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Preface 



This volume contains papers presented at the 17th Annual Conference on Lear- 
ning Theory (previously known as the Conference on Computational Learning 
Theory) held in Banff, Canada from July 1 to 4, 2004. 

The technical program contained 43 papers selected from 107 submissions, 3 
open problems selected from among 6 contributed, and 3 invited lectures. The 
invited lectures were given by Michael Kearns on ‘Game Theory, Automated 
Trading and Social Networks’, Moses Charikar on ‘Algorithmic Aspects of Fi- 
nite Metric Spaces’, and Stephen Boyd on ‘Convex Optimization, Semidefinite 
Programming, and Recent Applications’. These papers were not included in this 
volume. 

The Mark Fulk Award is presented annually for the best paper co-authored 
by a student. This year the Mark Fulk award was supplemented with two further 
awards funded by the Machine Learning Journal and the National Information 
Communication Technology Centre, Australia (NICTA). We were therefore able 
to select three student papers for prizes. The students selected were Magalie Fro- 
mont for the single-author paper “Model Selection by Bootstrap Penalization for 
Classification” , Daniel Reidenbach for the single-author paper “On the Learna- 
bility of E-Pattern Languages over Small Alphabets” , and Ran Gilad-Bachrach 
for the paper “Bayes and Tukey Meet at the Center Point” (co-authored with 
Amir Navot and Naftali Tishby). 

This year saw an exceptional number of papers submitted to COLT cover- 
ing a wider range of topics than has previously been the norm. This exciting 
expansion of learning theory analysis to new models and tasks marks an im- 
portant development in the growth of the area as well as in the linking with 
practical applications. The large number of quality submissions placed a heavy 
burden on the program committee of the conference: Shai Ben-David (Cornell 
University), Stephane Boucheron (Universite Paris-Sud), Olivier Bousquet (Max 
Planck Institute), Sanjoy Dasgupta (University of California, San Diego), Vic- 
tor Dalmau (Universitat Pompeu Fabra), Andre Elisseeff (IBM Zurich Research 
Lab), Thore Graepel (Microsoft Research Labs, Cambridge), Peter Grunwald 
(CWI, Amsterdam), Michael Jordan (University of California, Berkeley), Adam 
Kalai (Toyota Technological Institute), David McAllester (Toyota Technological 
Institute), Manfred Opper (University of Southampton), Alon Orlitsky (Univer- 
sity of California, San Diego), Rob Schapire (Princeton University), Matthias 
Seeger (University of California, Berkeley), Satinder Singh (University of Michi- 
gan), Eiji Takimoto (Tohoku University), Nicolas Vayatis (Universite Paris 6), 
Bin Yu (University of California, Berkeley) and Thomas Zeugmann (University 
at Liibeck). We are extremely grateful for their careful and thorough reviewing 
and for the detailed discussions that ensured the very high quality of the final 
program. We would like to have mentioned the subreviewers who assisted the 
program committee in reaching their assessments, but unfortunately space con- 
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straints do not permit us to include this long list of names and we must simply 
ask them to accept our thanks anonymously. 

We particularly thank Rob Holte and Dale Schuurmans, the conference local 
chairs, as well as the registration chair Kiri Wagstaff. Together they handled 
the conference publicity and all the local arrangements to ensure a successful 
event. We would also like to thank Microsoft for providing the software used in 
the program committee deliberations, and Ofer Dekel for maintaining this soft- 
ware and the conference Web site. Bob Williamson and Jyrki Kivinen assisted 
the organization of the conference in their role as consecutive Presidents of the 
Association of Computational Learning, and heads of the COLT Steering Com- 
mittee. We would also like to thank the ICML organizers for ensuring a smooth 
co-location of the two conferences and arranging for a ‘kernel day’ at the overlap 
on July 4. The papers appearing as part of this event comprise the last set of 8 
full-length papers in this volume. 

Finally, we would like to thank the Machine Learning Journal, the Pacific 
Institute for the Mathematical Sciences (PIMS), INTEL, SUN, the Informatics 
Circle of Research Excellence (iCORE), and the National Information Com- 
munication Technology Centre, Australia (NICTA) for their sponsorship of the 
conference. This work was also supported in part by the 1ST Programme of the 
European Community, under the PASCAL Network of Excellence, IST-2002- 
506778. 



April, 2004 John Shawe-Taylor, 

Yoram Singer 
Program Co-chairs, COLT 2004 
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Abstract. Communication complexity has recently been recognized as 
a major obstacle in the implementation of combinatorial auctions. In this 
paper, we consider a setting in which the auctioneer (elicitor), instead of 
passively waiting for the bids presented by the bidders, elicits the bidders’ 
preferences (or valuations) by asking value queries. It is known that in 
the more general case (no restrictions on the bidders’ preferences) this 
approach requires the exchange of an exponential amount of information. 
However, in practical economic scenarios we might expect that bidders’ 
valuations are somewhat structured. In this paper, we consider several 
such scenarios, and we show that polynomial elicitation in these cases is 
often sufficient. We also prove that the family of “easy to elicit” classes of 
valuations is closed under union. This suggests that efficient preference 
elicitation is possible in a scenario in which the elicitor, contrary to what 
it is commonly assumed in the literature on preference elicitation, does 
not exactly know the class to which the function to elicit belongs. Finally, 
we discuss what renders a certain class of valuations “easy to elicit with 
value queries”. 



1 Introduction 

Combinatorial auctions (CAs) have recently emerged as a possible mechanism to 
improve economic efficiency when many items are on sale. In a CA, bidders can 
present bids on bundle of items, and thus may easily express complementarities 
(i.e., the bidder values two items together more than the sum of the valuations 
of the single items), and substitutabilities (i.e., the two items together are worth 
less than the sum of the valuations of the single items) between the objects 

This work is supported in part by NSF under CAREER Award IRI-9703122, Grant 
IIS-9800994, ITR IIS-0081246, and ITR IIS-0121678. 

This work was done when the author was visiting the Dept, of Computer Science, 
Carnegie Mellon University. 
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on sale^. CAs can be applied, for instance, to sell spectrum licenses, pollution 
permits, land lots, and so on [9]. 

The implementation of CAs poses several challenges, including computing 
the optimal allocation of the items (also known as the winner determination 
problem), and efficiently communicating bidders’ preferences to the auctioneer. 

Historically, the first problem that has been addressed in the literature is 
winner determination. In [16], it is shown that solving the winner determination 
problem is NP-hard; even worse, finding a n^/^“*’-approximation (here, n is the 
number of bidders) to the optimal solution is NP-hard [18]. Despite these impos- 
sibility results, recent research has shown that in many scenarios the average-case 
performance of both exact and approximate winner determination algorithms is 
very good [4,13,17,18,22]. This is mainly due to the fact that, in practice, bidders’ 
preferences (and, thus, bids) are somewhat structured, where the bid structure 
is usually induced by the economic scenario considered. 

The communication complexity of CAs has been addressed only more re- 
cently. In particular, preference elicitation, where the auctioneer is enhanced by 
elicitor software that incrementally elicits the bidders’ preferences using queries, 
has recently been proposed to reduce the communication burden. Elicitation al- 
gorithms based on different type of queries (e.g., rank, order, or value queries) 
have been proposed [6,7,12]. Unfortunately, a recent result by Nisan and Segal 
[15] shows that elicitation algorithms in the worst case have no hope of consid- 
erably reducing the communication complexity, because computing the optimal 
allocation requires the exchange of an exponential amount of information be- 
tween the elicitor and the bidders. Indeed, the authors prove an even stronger 
negative result: obtaining a better approximation of the optimal allocation than 
that generated by auctioning off all objects as a bundle requires the exchange 
of an exponential amount of information. Thus, the communication burden pro- 
duced by any combinatorial auction design that aims at producing a non-trivial 
approximation of the optimal allocation is overwhelming, unless the bidders’ val- 
uation functions display some structure. This is a far worse scenario than that 
occurring in single item auctions, where a good approximation to the optimal 
solution can be found by exchanging a very limited amount of information [3] . 

For this reason, elicitation in restricted classes of valuation functions has been 
studied [2,8,15,21]. The goal is to identify classes of valuation functions that are 
general (in the sense that they allow to express super-, or sub-additivity, or both, 
between items) and can be elicited in polynomial time. 

Preference elicitation in CAs has recently attracted significant interest from 
machine learning theorists in general [6,21], and at COLT in particular [2]. 

1.1 Full Elicitation with Value Queries 

In this paper, we consider a setting in which the elicitor’s goal is full elicitation, 
i.e., learning the entire valuation function of all the bidders. This definition 
should be contrasted with the other definition of preference elicitation, in which 

^ In this paper, we will use also the terms super- and sub-additivity to refer comple- 
mentarities and substitutabilities, respectively. 
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the elicitor’s goal is to elicit enough information from the bidders so that the 
optimal allocation can be computed. In this paper, we call this type of elicitation 
partial elicitation. Note that, contrary to the case of partial elicitation, in full 
elicitation we can restrict attention to learning the valuation of a single bidder. 

One motivation for studying full elicitation is that, once the full valuation 
functions of all the bidders are known to the auctioneer, the VCG payments [5,11, 
20] can be computed without further message exchange. Since VCG payments 
prevent strategic bidding behavior [14], the communication complexity of full 
preference elicitation is an upper bound to the communication complexity of 
truthful mechanisms for combinatorial auctions. 

In this paper, we focus our attention on a restricted case of full preference 
elicitation, in which the elicitor can ask only value queries (what is the value of 
a particular bundle?) to the bidders. Our interest in value queries is due to the 
fact that, from the bidders’ point of view, these queries are very intuitive and 
easy to understand. Furthermore, value queries are in general easier to answer 
than, for instance, demand (given certain prices for the items, which would be 
your preferred bundle?) or rank (which is your most valuable bundle?) queries. 

Full preference elicitation with value queries has been investigated in a few re- 
cent papers. In [21], Zinkevich et al. introduce two classes of valuation functions 
(read-once formulas and ToolboxDNF formulas) that can be elicited with a poly- 
nomial number of value queries. Read-once formulas can express both sub- and 
super-additivity between objects, while ToolboxDNF formulas can only express 
super-additive valuations. In [8], we have introduced another class of “easy to 
elicit with value queries” functions, namely fc-wise dependent valuations. Func- 
tions in this class can display both sub- and super-additivity, and in general are 
not monotone^ (i.e., they can express costly disposal). 

1.2 Our Contribution 

The contributions of this paper can be summarized as follows: 

• We introduce the hypercube representation of a valuation function, which 
makes the contribution of every sub-bundle to the valuation of a certain bundle 
S explicit. This representation is a very powerful tool in the analysis of structural 
properties of valuations. 

• We study several classes of “easy to elicit with value queries” valuations. 
Besides considering the classes already introduced in the literature, we introduce 
several new classes of polynomially elicitable valuations. 

• We show that the family of “easy to elicit” classes of valuations is closed 
under union. More formally, we prove that, if Ci and C 2 are classes of valuations 
elicitable asking at most pi{m) and P 2 {m) queries, respectively, then any function 
in Cl U C 2 is elicitable asking at most Pi(m) +P 2 {‘m) + 1 queries. Furthermore, 
we prove that this bound cannot be improved. 

^ A valuation function / is monotone if f{S) > f{S'), for any S' C S. This property 
is also know as free disposal, meaning that bidders that receive extra items incur no 
cost for disposing them. 
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• The algorithm used to elicit valuations in Ci IJ C 2 might have super- 
polynomial running time (but asks only polynomially many queries). The ques- 
tion of whether a general polynomial time elicitation algorithm exists remains 
open. However, we present a polynomial time elicitation algorithm which, given 
any valuation function / in RO+m IJ Tool_t U Toolt IJ G 2 IJ INT (see Section 
3 for the definition of the various classes of valuations) , learns / correctly. This 
is an improvement over existing results, in which the elicitor is assumed to know 
exactly the class to which the valuation function belongs. 

• In the last part of the paper, we discuss what renders a certain class of val- 
uations “easy to elicit” with value queries. We introduce the concept of strongly 
non-inferable set of a class of valuations, and we prove that if this set has super- 
polynomial size then efficient elicitation is not possible. On the other hand, even 
classes of valuations with empty strongly non-inferable set can be hard to elicit. 
Furthermore, we introduce the concept of non-deterministic poly-query elicita- 
tion, and we prove that a class of valuations is non-deterministically poly-query 
elicitable if and only if its teaching dimension is polynomial. 

Overall, our results seem to indicate that, despite the impossibility result of 
[15], efficient and truthful CA mechanisms are a realistic goal in many economic 
scenarios. In such scenarios, elicitation can be done using only a simple and very 
intuitive kind of query, i.e. value query. 

2 Preliminaries 

Let / denotes the set of items on sale (also called the grand bundle), with | J| = m. 
A valuation function on I {valuation for short) is a function / : 2^ 1 — >■ M+ that 
assigns to any bundle S' C / its valuation. A valuation is linear, denoted fi, if 
fi{^) = X^aGS /(®)- To make the notation less cumbersome, we will use a,b, . . . 
to denote singletons, ab, be, . . . to denote two-item bundles, and so on. 

Given any bundle S, q{S) denotes the value query correspondent to S. In this 
paper, value queries are the only type of queries the elicitor can ask the bidder 
in order to learn her preferences. Unless otherwise stated, in the following by 
“query” we mean “value query”. 

Definition 1 (PQE). A class of valuations C is said to be poly-query (fully) 
elicitable if there exists an elicitation algorithm which, given as input a descrip- 
tion of C, and by asking value queries only, learns any valuation / G C asking 
at most p{m) queries, for some polynomial p{m). PQE is the set of all classes 
C that are poly-query elicitable. 

The definition above is concerned only with the number of queries asked 
(communication complexity). Below, we define a stronger notion of efficiency, 
accounting for the computational complexity of the elicitation algorithm. 

Definition 2 (PTE). A class of valuations C is said to be poly-time (fully) 
elicitable if there exists an elicitation algorithm which, given as input a descrip- 
tion of C, and by asking value queries only, learns any valuation f € C in 
polynomial time. PTE is the set of all classes C that are poly-time elicitable. 
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It is clear that poly-time elicitability implies poly-query elicitability. 

Throughout this paper, we will make extensive use of the following represen- 
tation of valuation functions. We build the undirected graph Hj introducing a 
node for any subset of / (including the empty set), and an edge between any two 
nodes Si, S 2 such that Si C S 2 and |5'i| = |5'2| -I- 1 (or vice versa). It is imme- 
diate that Hi, which represents the lattice of the inclusion relationship between 
subsets of I, is a binary hypercube of dimension m. Nodes in Hi can be parti- 
tioned into levels according to the cardinality of the corresponding subset: level 
0 contains the empty set, level 1 the m singletons, level 2 the subsets of 

two items, and so on. 

The valuation function / can be represented using Hi by assigning a weight 
to each node of Hi as follows. We assign weight 0 to the empty set^, and weight 
/(a) to any singleton a. Let us now consider a node at level 2, say node 
The weight of the node is f{ab) — (f{a) + /(&)). At the general step i, we assign 
to node Si, with jAil = i, the weight f{Si) — X^scSi where w{S) denotes 

the weight of the node corresponding to subset S. We call this representation of 
/ the hypercube representation of f, denoted Hi{f). 

The hypercube representation of a valuation function makes it explicit the 
fact that, under the common assumption of no externalities®, the bidder’s valua- 
tion of a bundle S depends only on the valuation of all the singletons a £ S, and 
on the relationships between all possible sub-bundles included in S. In general, 
an arbitrary sub-bundle of S may show positive or negative interactions between 
the components, or may show no influence on the valuation of S. In the hyper- 
cube representation, the contribution of any such sub-bundle on the valuation 
of S is isolated, and associated as a weight to the corresponding node in Hi. 

Given the hypercube representation Hi(f) of f, the valuation of any bundle 
S can be obtained by summing up the weights of all the nodes S' in Hi{f) such 
that S' C S. These are the only weights contained in the sub-hypercube of Hi{f) 
“rooted” at S. 

Proposition 1. Any valuation function f admits a hypercube representation, 
and this representation is unique. 

Proof. For the proof of this proposition, as well as of all for the proofs of the 
other theorems presented in this work, see the full version of the paper [19]. 

Given Proposition 1, the problem of learning / can be equivalently restated 
as the problem of learning all the weights in Hi(f). In this paper, we will often 
state the elicitation problem in terms of learning the weights in Hi{f), rather 
than the value of bundles. 

® That is, we assume that the valuation function is normalized. 

^ Slightly abusing the notation, we denote with ab both the bundle composed by the 
two items a and b, and the corresponding node in Hi. 

® With no externalities, we mean here that the bidder’s valuation depends only on 
the set of items S that she wins, and not on the identity of the bidders who get the 
items not in S. 
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Since the number of nodes in H[ is exponential in m, the hypercube repre- 
sentation of / is not compact, and cannot be used directly to elicit /. However, 
this representation is a powerful tool in the analysis of structural properties of 
valuation functions. 



3 Classes of Valuations in PTE 

In this section, we consider several classes of valuation functions that can be 
elicited in polynomial time using value queries. 

3.1 Read- Once Formulas 

The class of valuation functions that can be expressed as read-once formulas, 
which we denote RO, has been introduced in [21]. A read-once formula is a 
function that can be represented as a “reverse” tree, where the root is the output, 
the leaves are the inputs (corresponding to items), and internal nodes are gates. 
The leaf nodes are labeled with a real- valued multiplier. The gates can be of 
the following type: SUM, MAXc, and ATLEASTc. The SUM operator simply 
sums the values of its inputs; the MAXc operator returns the sum of the c 
highest inputs; the ATLEASTc operator returns the sum of its inputs if at least 
c of them are non-zero, otherwise returns 0. In [21], it is proved that read-once 
formulas are in PTE. 

In general, valuation functions in RO can express both complementarities 
(through the ATLEASTc operator) and substitutabilities (through the MAXc 
operator) between items. If we restrict our attention to the class of read-once 
formulas that can use only SUM and MAX operators (here, MAX is a shortcut 
for MAXi), then only sub-additive valuations can be expressed. This restricted 
class of read-once formulas is denoted RO+m in the following. 



3.2 fe-wise Dependent Valuations 

The class of A:- wise dependent valuations, which we denote G^, has been defined 
and analyzed in [8]. fc-wise dependent valuations are defined as follows: 

Definition 3. A valuation function f is k-wise dependent if the only mutual 
interactions between items are on sets of cardinality at most k, for some constant 
k > 0. In other words, the Gk class corresponds to all valuation functions f such 
that the weights associated to nodes at level i in Hj{f) are zero whenever i > k. 

Note that functions in Gk might display both sub and super-additivity be- 
tween items. Furthermore, contrary to most of the classes of valuation functions 
described so far, fc-wise dependent valuations might display costly disposal. 

In [8], it is shown that valuations in Gk can be elicited in polynomial time 
asking O(m^) value queries. 
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3.3 The Toolt Class 

The class of ToolboxDNF formulas, which we denote Toolt, has been introduced 
in [21], and is defined as follows: 

Definition 4. A function f is in Toolt, where t is polynomial in m, if it can he 
represented by a polynomial p composed oft monomials (minterms), where each 
monomial is positive. 

For instance, polynomial p = 3a + 4ab + 2bc + cd corresponds to the function 
which gives value 3 to item a, 0 to item b, value 9 to the bundle abc, and so on. 
Note if / G Toolt, the only non-zero weights in Hj{f) are those associated to 
the minterms of /. 

ToolboxDNF valuations can express only substitutability-free valuations®, 
and can be elicited in polynomial time asking 0{mt) value queries [21]. 

3.4 The Tool_t Class 

This class of valuation functions is a variation of the ToolboxDNF class intro- 
duced in [21]. The class is defined as follows. 

Definition 5. Tool„t is the class of all the valuation functions f such that 
exactly t of the weights in Hj{f) are non-zero, where t is polynomial in m. Of 
these weights, only those associated to singletons can he positive. The bundles 
associated to non-zero weights in Hj{f) are called the minterms of f. 

In other words, the Tool_t class corresponds to all valuation functions that 
can be expressed using a polynomial p with t monomials (minterms), where 
the only monomials with positive sign are composed by one single literal. For 
instance, function / defined by p = 10a -I- 156 -I- 3c — 2ab — 36c gives value 10 to 
item a, value 23 to the bundle ab, and so on. 

Theorem 1. If f £ Tool_t, where t is polynomial in m, then it can be elicited 
in polynomial time by asking 0{mt) queries. 

3.5 Interval Valuation Functions 

The class of interval valuations is inspired by the notion of interval bids [16,17], 
which have important economic applications. The class is defined as follows. The 
items on sale are ordered according to a linear order, and they can display super- 
additive valuations when bundled together only when the bundle corresponds 
to an interval in this order. We call this class of sustitutability-free valuations 
INTERVAL, and we denote the set of all valuations in this class as INT. 

An example of valuation in INT is the following: there are three items on 
sale, a, 6 and c, and the linear order is a < 6 < c. We have /(a) = 10, /(6) = 5, 

® A valuation function / is substitutability-free if and only if, for any Si, S 2 C 7, we 
have /(5i) + /(&)< /(Si US 2 ). 
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/(c) = 3, f{ab) = 17, f{bc) = 10, f{ac) = f{a) + /(c) = 13 (because bundle ac 
is not an interval in the linear order), and f{abc) = 21. 

The INT class displays several similarities with the Toolt class: there are 
a number of basic bundles (minterms) with non-zero value, and the value of a 
set of items depends on the value of the bundles that the bidder can form with 
them. However, the two classes turn out to be not comparable with respect to 
inclusion, i.e. there exist valuation functions /, /' such that / G Toolt ~ INT 
and /' G INT - Toolt. For instance, the valuation function corresponding to 
the polynomial p = o-|-6-|-c-|-a6-|-6c-|-ac is in Toolt —INT, since objects can be 
bundled “cyclically” . On the other hand, the valuation function / of the example 
above cannot be expressed using a ToolboxDNF function. In fact, the value of the 
bundles a, 6, c, ab, be and ac gives the polynomial p' = 10a -I- 56 -I- 3c -I- 2ab + 26c. 
In order to get the value 21 for the bundle abc, which clearly include all the 
sub-bundles in p' , we must add the term abc in p' with negative weight -1. Since 
only positive terms are allowed in Toolt, it follows that / G INT — Toolt. 

What about preference elicitation with value queries in case / G INT? It 
turns out that the efficiency of elicitation depends on what the elicitor knows 
about the linear ordering of the objects. We distinguish three scenarios: 

a) the elicitor knows the linear ordering of the items; 

b) the elicitor does not know the linear ordering of the items, but the valu- 
ation function / to be elicited is such that f{ab) > f{a) + f{b) if and only if a 
and 6 are immediate neighbors in the ordering. 

c) the elicitor does not know the linear ordering of the items, and the valua- 
tion function to be elicited is such that f{ab) = f{a) + /(6) does not imply that 
a and 6 are not immediate neighbors in the ordering. For instance, we could have 
a <b <c, f{ab) > /(o) -k /(6), f{bc) = f{b) + /(c), and /(o6c) > f{ab) + /(c) 
(i.e., the weight of abc in Hj{f) is greater than zero). 

The following theorem shows that poly-time elicitation is feasible in scenarios 
a) and 6). Determining elicitation complexity under the scenario c) remains open. 



Theorem 2. If f € INT, then: 

- Scenario a): it can be elicited in polynomial time asking value 

queries; 

- Scenario 6) .• it can he elicited in polynomial time asking at most w? — m + \ 
value queries. 



3.6 Tree Valuation Functions 

A natural way to extend the INT class is to consider those valuation functions 
in which the relationships between the objects on sale have a tree structure. 
Unfortunately, it turns out that the valuation functions that belong to this class, 
which we denote TREE, are not poly-query elicitable even if the structure of 
the tree is known to the elicitor. 
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Theorem 3. There exists a valuation function f € TREE that can he learned 
correctly only asking at least 2™/^ value queries, even if the elicitor knows the 
structure of the tree. 

However, if we restrict the super-additive valuations to be only on subtrees 
of the tree T that describes the item relationships, rather than on arbitrary 
connected subgraphs of T, then polynomial time elicitation with value queries 
is possible (given that T itself can be learned in polytime using value queries). 

Theorem 4. Assume that the valuation function f € TREE is such that super- 
additive valuations are only displayed between objects that form a subtree of T, 
and assume that the elicitor can learn T asking a polynomial number of value 
queries. Then, f can he elicited asking a polynomial number of value queries. 

4 Generalized Preference Elicitation 

In the previous section we have considered several classes of valuation functions, 
proving that most of them are in PTE. However, the definition of PTE (and of 
PQE) assumes that the elicitor has access to a description of the class of the 
valuation to elicit; in other words, the elicitor a priori knows the class to which 
the valuation function belongs. In this section, we analyze preference elicitation 
under a more general framework, in which the elicitor has some uncertainty 
about the actual class to which the valuation to elicit belongs. 

We start by showing that the family of poly-query elicitable classes of valu- 
ations is closed under union. 

Theorem 5. Let Ci and C 2 be two classes of poly -query elicitable valuations, 
and assume that p\{m) {resp., p 2 {m)) is a polynomial such that any valuation 
in Cl {resp., C 2 ) can he elicited asking at most pi{m) {resp., p 2 {m)) queries. 
Then, any valuation in Ci IJ C 2 can he elicited asking at most pi{m)-\-p 2 {m)-\-l 
queries. 

In the following theorem, we prove that the bound on the number of queries 
needed to elicit a function in Ci IJ C 2 stated in Theorem 5 is tight. 

Theorem 6. There exist families of valuation functions Ci,C 2 such that either 
Ci can he elicited asking at most m — 1 queries, hut Ci U C 2 cannot be elicited 
asking less than 2m — 1 = 2(m — 1) -I- 1 queries (in the worst case). 

Theorem 5 shows that, as far as communication complexity is concerned, effi- 
cient elicitation can be implemented under a very general scenario: if the only 
information available to the elicitor is that / G Ci IJ • • • IJ Cq(m), where the CiS 
are in PQE and q{m) is an arbitrary polynomial, then elicitation can be done 
with polynomially many queries. This is a notable improvement over traditional 
elicitation techniques, in which it is assumed that the elicitor knows exactly the 
class to which the function to elicit belongs. 
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Although interesting, Theorem 5 leaves open the question of the computa- 
tional complexity of the elicitation process. In fact, the general elicitation algo- 
rithm Ai y 2 used in the proof of the theorem (see the full version of the paper 
[19]) has running time which is super-polynomial in m. So, a natural question to 
ask is the following: let Ci and C 2 be poly-time elicitable classes of valuations; 
Is the Cl y C 2 class elicitable in polynomial time? 

Even if we do not know the answer to this question in general, in the fol- 
lowing we show that, at least for many of the classes considered in this paper, 
the answer is yes. In particular, we present a polynomial time algorithm that 
elicits correctly any function / G RO+m U Tool_t y Toolt y G2 y INT. The 
algorithm is called GenPolyLearn, and is based on a set of theorems which 
show that, given any / G GiyG 2 , where Gi,G 2 are any two of the classes 
listed above, / can be learned correctly with a low-order polynomial bound on 
the runtime (see [19]). 

The algorithm, which is reported in Figure 1, is very simple: initially, the 
hypothesis set Hp contains all the five classes. After asking the value of any 
singleton, GenPolyLearn asks the value of any two-item bundles and, based 
on the corresponding weights on discards some of the hypotheses. When 

the hypotheses set contains at most two classes, the algorithm continues pref- 
erence elicitation accordingly. In case Hp contains more than two classes after 
all the two-item bundles have been elicited, one more value query (on the grand 
bundle) is sufficient for the elicitor to resolve uncertainty, reducing the size of 
the hypotheses set to at most two. The following theorem shows the correctness 
of GenPolyLearn, and gives a bound on its runtime. 

Theorem 7 . Algorithm GenPolyLearn learns correctly in polynomial time 
any valuation function in RO+m y Tool_t y Toolt y G2 y INT asking at 
most 0{m{m -\- 1)) value queries. 

From the bidders’ side, a positive feature of GenPolyLearn is that it asks 
relatively easy to answer queries: valuation of singletons, two-item bundles, and 
the grand bundle. (In many cases, the overall value of the market considered 
(e.g., all the spectrum frequencies in the US) is publicly available information.) 

5 Towards Characterizing Poly-query Elicitation 

In the previous sections we have presented several classes of valuation functions 
that can be elicited asking polynomially many queries, and we have proved that 
efficient elicitation can be implemeted in a quite general setting. In this section, 
we discuss the properties that these classes have in common, thus making a step 
forward in the characterization of what renders a class of valuations easy to elicit 
with value queries. 

Let G be a class of valuations, / any valuation in G, and Ac an elicitation 
algorithm for G^. Let Q be an arbitrary set of value queries, representing the 

In the following, we assume that the elicitation algorithm is a “smart” algorithm for 
C, i.e. an algorithm which is able to infer the largest amount of knowledge from the 
answers to the queries asked so far. 
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Algorithm GenPolyLearn: 



0. Hp={R,0-)-M,G2,Toolt,Tool— t,INT} 

1. bnild the hrst level of Hi{f) asking the value of singletons 

2. bnild the second level of Hi{f) asking the value of two- items bundles in 

arbitrary order 

3. let w{ab) the computed weight for bundle ab 

4. repeat 

5. if w{ab) < 0 then 

6. remove Toolt and INT from Hp 

7. if w{ab) 7 ^ — min{/(a), /(&)} then remove RO+m from Hp 

8. if w{ab)> 0 then 

9. remove RO +M and Tool—t from Hp 

10. if w(ab) is not compatible with the linear order discovered so far then 

11. remove INT from Hp 

12. until |Hp| < 2 or all the w{ab) have been considered 

13. if |Hp| < 2 then continue elicitation as described in theorems 6-15 of [19]. 
otherwise: 

14. case 1: all the w{ab) weights are > 0 and compatible with the linear order, and at 

least one weight is positive 

15. ask the value of the grand bundle I 

16. if /(/) =Esc/,|s|< 2 then 

17. remove Toolt from Hp 

18. continue elicitation as in the proof of Th. 9 of [19] 

19. else 

20. remove G 2 from Hp 

21. continue elicitation as in the proof of Th. 10 of [19] 

22. case 2: all the w{ab) weights are < 0, at least one weight is negative, 

and RO+m €Hp 

23. ask the value of the grand bundle I 

24. if /(/) / Esc/.|s|< 2 then 

25. remove G 2 from Hp 

26. continue elicitation as in the proof of Th. 15 of ]19] 

27. else 

28. remove Tool_t from Hp 

29. continue elicitation as in the proof of Th. 6 of ]19] 

30. case 3: w{ab) — 0 for all ab 

31. ask the value of the grand bundle I 

32. if /(/) <E,.gj /(a) then 

33. remove INT, Toolt, G 2 , RO+m from Hp 

34. / G Tool_t; continue elicitation accordingly 

35. else 

36. remove Tool_t, G 2 , RO+m from Hp 

37. proceed as in the proof of Th. 10 of [19] 

Fig. 1. Algorithm for learning correctly any valuation function in RO+m U 
Tool_t U Toolt U G 2 U INT asking a polynomial number of value queries. 
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queries asked by Ac at a certain stage of the elicitation process. Given the 
answers to the queries in Q, which we denote Q{f) (/ is the function to be 
elicited), and a description of the class C, Ac returns a set of learned values 
Vc{Q{f))- This set obviously contains any S such that q{S) G Q; furthermore, 
it may contain the value of other bundles (the inferred values), which are inferred 
given the description of C and the answers to the queries in Q. The elicitation 
process ends when Vc(Q(/)) = 2^ ■ 

Definition 6 (Inferability). Let S be an arbitrary bundle, and let f be any 
function in C. The f-inferability of S w.r.t. C is defined as: 

INf^c(S) = min{|Q| s.t. {q{S) i Q) and {S G Gc(Q(/)))} . 

If the value of S can be learned only by asking q{S), we set INf^c{S) = 2™ — 1. 
The inferability of S w.r.t. to C is defined as: 

INc{S) = mAxINf^ciS) . 

Intuitively, the inferability® of a bundle measures how easy it is for an elici- 
tation algorithm to learn the value of S without explicitly asking it. 

Definition 7 (Polynomially-inferable bundle). A bundle S is said to be 
poly-nomially-inferable {inferable for short) w.r.t. C ifINc{S) =p{m), for some 
polynomial p{m) . 

Definition 8 (Polynomially non-inferable bundle). A bundle S is said to 
be polynomially non-inferable {non-inferable for short) w.r.t. C if INc{S) is 
super-polynomial in m. 

Definition 9 (Strongly polynomially non-inferable bundle). A bundle S 
is said to be strongly polynomially non-inferable {strongly non-inferable for short) 
with respect to class C if\/fG C, INf c{S) is super-polynomial in m. 

Note the difference between poly and strongly poly non-inferable bundle: in 
the former case, there exists a function / in C such that, on input /, the value 
of S can be learned with polynomially many queries only by asking q{S); in the 
latter case, this property holds for all the valuations in C. 

Definition 10 (Non-inferable set). Given a class of valuations C, the non- 
inferable set of C, denoted NIc, is the set of all bundles in 2^ that are non- 
inferable w.r.t. C. 

Definition 11 (Strongly non-inferable set). Given a class of valuations C, 
the non-inferable set ofC, denoted SNIc, is the set of all bundles in 2^ that are 
strongly non-inferable w.r.t. C. 



When clear from the context, we simply speak of inferability, instead of inferability 

w.r.t. C. 
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Clearly, we have SNIc Q NIc- The following theorem shows that for some 
class of valuations C the inclusion is strict. Actually, the gap between the size 
of SNIc and that of NIc can be super-polynomial in m. 

The theorem uses a class of valuations introduced by Angluin [1] in the related 
context of concept learning. The class, which we call RDNF (RestrictedDNF) 
since it is a subclass of DNF formulas, is defined as follows. There are m = 2k 
items, for some A: > 0. The items are arbitrarily partitioned into k pairs, which 
we denote Si, with i = 1, . . . , fc. We also define a bundle S of cardinality k such 
that Vt, \Si Pi ;F| = 1. In other words, S is an arbitrary bundle obtained by taking 
exactly one element from each of the pairs. We call the SiS and the bundle S 
the minterms of the valuation function /. The valuations in RDNF are defined 
as follows: f{S) = 1^5” contains one of the minterms; f{S) = 0 otherwise. 

Theorem 8. We have |5'N/rdnf| = 0; while |N/rdnf| is super-polynomial 
in m. 

Proof. We first prove that |5 'N/r_dnf| = 0- Let / be any function in RDNF, 
and let Si, . . . , Sk, S he its minterms. Let S be an arbitrary bundle, and assume 
that S is not a minterm. Then, the value of S can be inferred given the answers to 
the queries Q' = {q{Si ), . . . , q{Sk),q{S)}, which are polynomially many. Thus, S 
is not in 5 'N/rdnf- Since for any bundle S there exists a function / in RDNF 
such that S is not one of the minterms of /, we have that S'N/rdnf is empty. 
Let us now consider IV/rdnf- Let S be an arbitrary bundle of cardinality k, and 
let / be a function in RDNF. If S is one of the minterms of / (i.e., S = S) the 
only possibility for the elicitor to infer its value is by asking the value of all the 
other bundles of cardinality k (there are super-polynomially many such bundles). 
In fact, queries on bundles of cardinality < fc of > fc -I- 1 give no information on 
the identity of S. So, S is in IV/rdnf- Since for any bundle S of cardinality fc 
there exists a function / in RDNF such that S' is a minterm of /, we have that 
AVLrdnf contains super-polynomially many bundles. 

The following theorem shows that whether a certain class C is in PQE de- 
pends to a certain extent on the size of SNIc- 

Theorem 9. Let C be an arbitrary class of valuations. If the size of SNIc is 
super-polynomial in m, then C ^ PQE. 

Theorem 9 states that a necessary condition for a class of valuations C to 
be easy to elicit is that its strongly non-inferable set has polynomial size. Is this 
condition also sufficient? The following theorem, whose proof follows immedi- 
ately by the fact that the RDNF class is hard to elicit with value queries [1], 
gives a negative answer to this question, showing that even classes C with an 
empty strongly non-inferable set may be hard to elicit. 

Theorem 10. The condition |S'N/c| = p{m) for some polynomial p(jn) is not 
sufficient for making C easy to elicit with value queries. In particular, we have 
that IS'N/rdnfI = 0, and RDNF ^ PQE. 
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Theorem 10 shows that the size of the strongly non-inferable set alone is not 
sufficient to characterize classes of valuations which are easy to elicit. Curiously, 
the size of the non-inferable set of RDNF is super-polynomial in m. Thus, the 
following question remains open: “Does there exist a class of valuations C such 
that |A^/c| = p{m) for some polynomial p{m) and C ^ PQET' or, equivalently, 
“Is the condition lA'^/cl = p{rn) for some polynomial p{m) sufficient for making 
C poly-query elicitable?” 

Furthermore, Theorem 10 suggests the definition of another notion of poly- 
query elicitation, which we call “non-deterministic poly-query elicitation” and 
denote with NPQE. Let us consider the RDNF class used in the proof of The- 
orem 8. In a certain sense, this class seems easier to elicit than a class C with 
|S'iV/c| superpolynomial in m. In case of the class C, any set of polynomially 
many queries is not sufficient to learn the function (no “poly-query certificate” 
exists). Conversely, in case of RDNF such “poly-query certificate” exists for any 
/ G RDNF (it is the set Q' as defined in the proof of Theorem 8); what makes 
elicitation hard in this case is the fact that this certificate is “hard to guess”. 
So, the RDNF class is easy to elicit if non-deterministic elicitation is allowed. 
The following definition captures this concept: 

Definition 12 (NPQE). A class of valuations C is said to be poly-query non- 
deterministic (fully) elicitable if there exists a nondeterministic elicitation algo- 
rithm which, given as input a description of C, and by asking value queries only, 
learns any valuation / G C asking at most p(m) queries in at least one of the 
nondeterministic computations, for some polynomial p{m) . NPQE is the set of 
all classes C that are poly-query nondeterministic elicitable. 

It turns out that non-deterministic poly-query elicitation can be character- 
ized using a notion introduced in [10], which we adapt here to the framework of 
preference elicitation. 

Definition 13 (Teaching dimension). Let C be a class of valuations, and let 
f be an arbitrary function in C. A teaching set for f w.r.t. C is a set of queries 
Q such that Vc(Q(/)) = 2^. The teaching dimension of C is defined as 

TD{C) = max min | |Q| s.t. (Q C 2^ ) and {Q is a teaching set for /)| . 

fac l j 



Theorem 11. Let C be an arbitrary class of valuations. C G NPEQ if and 
only ifTD{C) = p{m) for some polynomial p{m) . 

The following results is straightforward by observing that RDNF is in NPQE 
(it has 0(m) teaching dimension) but not in PQE: 

Proposition 2. PQE C NPQE. 
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Abstract. We introduce a graph-theoretic generalization of classical Arrow- 
Debreu economics, in which an undirected graph specifies which consumers or 
economies are permitted to engage in direct trade, and the graph topology may 
give rise to local variations in the prices of commodities. Our main technical con- 
tributions are: (1) a general existence theorem for graphical equilibria, which 
require local markets to clear; (2) an improved algorithm for computing approx- 
imate equilibria in standard (non-graphical) economies, which generalizes the 
algorithm of Deng et al. [2002] to non-linear utility functions; (3) an algorithm 
for computing equilibria in the graphical setting, which runs in time polynomial 
in the number of consumers in the special but important case in which the graph 
is a tree (again permitting non-linear utility functions). We also highlight many 
interesting learning problems that arise in our model, and relate them to learning 
in standard game theory and economics, graphical games, and graphical models 
for probabilistic inference. 



1 Introduction 

Models for the exchange of goods and their prices in a large economy have a long 
and storied history within mathematical economics, dating hack more than a century 
to the work of Walras [1874] and Fisher [1891], and continuing through the model of 
Wald [1936] (see also Brainard and Scarf [2000]). A pinnacle of this line of work came in 
1 954, when Arrow and Debreu provided extremely general conditions for the existence of 
an equilibrium in such models (in which markets clear, i.e. supply balances demand, and 
all individual consumers and firms optimize their utility subject to budget constraints). 
Like Nash’s roughly contemporary proof of the existence of equilibria for normal-form 
games (Nash [1951]), Arrow and Debreu’s result placed a rich class of economic models 
on solid mathematical ground. 

These important results established the existence of various notions of equilibria. The 
computation of game-theoretic and economic equilibria has been a more slippery affair. 
Indeed, despite decades of effort, the computational complexity of computing a Nash 
equilibrium for a general-sum normal-form game remains unknown, with the best known 
algorithms requiring exponential time in the worst case. Even less is known regarding 
the computation of Arrow-Debreu equilibria. Only quite recently, a polynomial-time 
algorithm was discovered for the special but challenging case of linear utility func- 
tions (Devanur et al. [2002], Jain et al. [2003], Devanur and Vazirani [2003]). Still less 
is known about the learning of economic equilibria in a distributed, natural fashion. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 17-32, 2004. 
0 Springer- Verlag Berlin Heidelberg 2004 
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One promising direction for making computational progress is to introduce alterna- 
tive ways of representing these problems, with the hope that wide classes of “natural” 
problems may permit special-purpose solutions. By developing new representations that 
permit the expression of common types of structure in games and economies, it may 
be possible to design algorithms that exploit this structure to yield computational as 
well as modeling benefits. Researchers in machine learning and artificial intelligence 
have proven especially adept at devising models that balance representational power 
with computational tractability and learnability, so it has been natural to turn to these 
literatures for inspiration in strategic and economic models. 

Among the most natural and common kinds of structure that arise in game-theoretic 
and economic settings are constraints and asymmetries in the interactions between the 
parties. By this we mean, for example, that in a large-population game, not all players 
may directly influence the payoffs of all others. The recently introduced formalism of 
graphical games captures this notion, representing a game by an undirected graph and a 
corresponding set of local game matrices (Kearns et al. [2001]). In Section 2 we briefly 
review the history of graphical games and similar models, and their connections with 
other topics in machine learning and probabilistic inference. 

In the same spirit, in this paper we introduce a new model called graphical economics 
and show that it provides representational and algorithmic benefits for Arrow-Debreu 
economics. Each vertex i in an undirected graph represents an individual party in a 
large economic system. The presence of an edge between i and j means that free trade 
is allowed between the two parties, while the absence of this edge means there is an 
embargo or other restriction on direct trade. The graph could thus represent a network of 
individual business people, with the edges indicating who knows whom; or the global 
economy, with the edges representing nation pairs with trade agreements; and many 
other settings. Since not all parties may directly engage in trade, the graphical economics 
model permits (and realizes) the emergence of local prices — that is, the price of the 
same good may vary across the economy. Indeed, one of our motivations in introducing 
the model is to capture the fact that price differences for identical goods can arise due 
to the network structure of economic interaction. 

We emphasize that the mere introduction of a network or graph structure into eco- 
nomic models is in itself not a new idea; while a detailed history of such models is beyond 
our scope, Jackson [2003] provides an excellent survey. However, to our knowledge, the 
great majority of these models are designed to model specific economic settings. Our 
model has deliberately incorporated a network model into the general Arrow-Debreu 
framework. Our motivation is to capture and understand network interactions in what is 
the most well-studied of mathematical economic models. 

The graphical economics model suggests a local notion of clearance, directly derived 
from that of the Arrow-Debreu model. Rather than asking that the entire (global) market 
clear in each good, we can ask for the stronger “provincial” conditions that the local 
market for each good must clear. For instance, the United States is less concerned that the 
worldwide production of beef balances worldwide demand than it is that the production 
of American beef balances worldwide demand for American beef. If this latter condition 
holds, the American beef industry is doing a good job at matching the global demand 
for their product, even if other countries suffer excess supply or demand. 
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The primary contributions of this paper are: 

- The introduction of the graphical economics model (which lies within the Arrow- 
Debreu framework) for capturing structured interaction between individuals, orga- 
nizations or nations. 

- A proof that under very general conditions (essentially analogous to Arrow and 
Debreu’s original conditions), graphical equilibria always exist. This proof requires 
a non-trivial modification to that of Arrow and Debreu. 

- An algorithm for computing approximate standard market equilibria in the non- 
graphical setting that runs in time polynomial in the number of players (fixing the 
number of goods) for a rather general class of non-linear utility functions. This result 
generalizes the algorithm of Deng et al. [2002] for linear utility functions. 

- An algorithm, called ADProp (for Arrow-Debreu Propagation) for computing ap- 
proximate graphical equilibria. This algorithm is a message-passing algorithm work- 
ing directly on the graph, in which neighboring consumers or economies exchange 
information about trade imbalances between them under potential equilibria prices. 
In the case that the graph is a tree, the running time of the algorithm is exponential 
in the graph degree and number of goods k, but only polynomial in the number of 
vertices n (consumers or economies). It thus represents dramatic savings over treat- 
ing the graphical case with a non-graphical algorithm, which results in a running 
time exponential in n (as well as in k). 

- A discussion of the many challenging learning problems that arise in both the tradi- 
tional and graphical economic models. This discussion is provided in Section 6. 



2 A Brief History of Graphical Games 

In this section, we review the short but active history of work in the model known as 
graphical games, and highlight connections to more longstanding topics in machine 
learning and graphical models. 

Graphical games were introduced in Kearns et al. [2001], where a representation 
consisting of an undirected graph and a set of local payoff matrices was proposed for 
multi-player games. The interpretation is that the payoff to player i is a function of the 
actions of only those players in the neighborhood of vertex i in the graph. Exactly as with 
the graphical models for probabilistic inference that inspired them (such as Bayesian 
and Markov networks), graphical games provide an exponentially more succinct repre- 
sentation in cases where the number of players is large, but the degree of the interaction 
graph is relatively small. 

A series of papers by several authors established the computational benefits of this 
model. Kearns et al. [2001] gave a provably efficient (polynomial in the model size) 
algorithm for computing all approximate Nash equilibria in graphical games with a 
tree topology; this algorithm can be formally viewed as the analogue of the junction 
tree algorithm for inference in tree-structured Markov networks. A related algorithm 
described in Littman et al. [2002] computes a single but exact Nash equilibrium. 

In the same way that the junction tree and polytree algorithms for probabilis- 
tic inference were generalized to obtain the more heuristic belief propagation algo- 
rithm, Ortiz and Kearns [2003] proposed the NashProp algorithm for arbitrary graphi- 
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cal games, proved its convergence, and experimentally demonstrated promising perfor- 
mance on a wide class of graphs. Vickrey and Roller [2002] proposed and experimen- 
tally compared a wide range of natural algorithms for computing equilibria in graphical 
games, and quite recently Blum et al. [2003] developed an interesting new algorithm 
based on continuation methods. 

An intriguing connection between graphical games and Markov networks was es- 
tablished in Kakade et al. [2003], in the context of the generalization of Nash equilibria 
known as correlated equilibria. There it was shown that if G is the underlying graph of a 
graphical game, then all the correlated equilibria of the game (up to payoff equivalence) 
can be represented as a Markov network whose underlying graph is almost identical 
to G — in particular, only a small number of highly localized connections need to be 
added. This result establishes a natural and very direct relationship between the strategic 
structure of interaction in a multi-player game, and the probabilistic dependency struc- 
ture of any (correlated) equilibrium. In addition to allowing one to establish non-trivial 
independencies that must hold at equilibrium, this result is also thought-provoking from 
a learning perspective, since a series of recent papers has established that correlated 
equilibrium appears to be the natural convergence notion for a wide class of “rational” 
learning dynamics. We shall return to this topic when we discuss learning in Section 6. 



3 Graphical Economies 

The classical Arrow-Debreu (AD in the sequel) economy (without hrms) consists of n 
consumers who trade k commodities of goods amongst themselves in an unrestricted 
manner. In an AD economy, each unit of commodity h G {1, . . . , fc} can be bought 
by any consumer at prices ph- We denote the vector of prices to be p G 7Z^ (where 
7Z+ = {x > 0}). 

Each consumer i purchases a consumption plan x * G where x\ is the amount 
of commodity h that is purchased by i. We assume that each consumer i has an initial 
endowment e * G TZ\ of the k commodities, where is the amount of commodity h 
initially held by i. These commodities can be sold to other consumers and thus provide 
consumer i with wealth or cash, which can in turn be used to purchase other goods. 
Hence, if the initial endowment of consumer i is completely sold, then the wealth of 
consumer i is p • e ®. A consumption plan x * is budget constrained if p ■ x * < p - e % 
which implicitly assumes the endowment is completely sold (which in fact holds at 
equilibrium). 

Every consumer i has a utility function ut : TZ\ -G Ti.+, where Ui{x *) describes 
how much utility consumer i receives from consuming the plan x * . The utility function 
thus expresses the preferences a consumer has for varying bundles of the k goods. 

A graphical economy with n players and k goods can be formalized as a standard AD 
economy with nk “traditional” goods, which are indexed by the pairs (z, h). The good 
(z, h) is interpreted as “good h sold by consumer z”. The key restriction is that free trade 
is not permitted between consumers, so all players may not be able to purchase {i,h). 
It turns out that with these trade restrictions, we were not able to invoke the original 
existence proof used in the standard Arrow-Debreu model, and we had to use some 
interesting techniques to prove existence. 
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It is most natural to specify the trade restrictions through an undirected graph, G, 
over the n consumers ' . The graph G specifies how the consumers are allowed to trade 
with each other — each consumer may have a limited choice of where to purchase 
commodities. The interpretation of G is that if (z, j) is an edge in G, then free trade 
exists between consumers i and j, meaning that i is allowed to buy commodities from 
j and vice-versa; while the lack of an edge between i and j means that no direct trade 
is permitted. More precisely, if we use N{i) to denote the neighbor set of i (which by 
convention includes i itself), then consumer i is free to buy any commodity only from 
any of the consumers in N{i). It will naturally turn out that rational consumers only 
purchase goods from a neighbor with the best available price. 

Associated with each consumer z is a local price vector p * G 7^^, where is the 
price at which commodity h is being sold by i. We denote the set of all local price vectors 
by P = {p * : i = 1, ... ,n}. Each consumer i purchases an amount of commodities 
X *7 g , where is the amount of commodity h that is purchased from consumer 
j by consumer i. The trade restrictions imply that a; *7 = Q for j ^ N{i). Here, the 
consumption plan is the set A® = {x ®7 : j g N{i)} and an X® is budget constrained 
if X)jeAf(i) ^ p ® • e ® which again implicitly assumes the endowment is 

completely sold (which holds at equilibrium). 

In the graphical setting, we assume the utility function only depends on the total 
amount of each commodity consumed, independent of whom it was purchased from. This 
expresses the fact that the goods are identical across the economy, and consumers seek the 
best prices available to them. Slightly abusing notation, we define x ® = 'Yhj^N(i) ® 
which is fhe total vector amount of goods consumed by i under the plan X®. The utility 
of consumer i is given by the function Ui{x'^), which is a function from TZ^. 

4 Graphical Equilibria 

In equilibrium, there are two properties which we desire to hold — consumer ratio- 
nality and market clearance. We now define these and state conditions under which an 
equilibrium is guaranteed. 

The economic motivation for a consumer in the choice of consumption plans is 
to maximize utility subject to a budget constraint. We say that a consumer i uses an 
optimal plan at prices P if the plan maximizes utility over the set of all plans which 
are budget constrained under P. For instance, in the graphical setting, a plan X® for i 
is optimal at prices P if the plan X® maximizes the function Ui over all X'® subject to 
<p®-e®. 

We say the market clears if the supply equals the demand. In the standard setting, 
define fhe tofal demand vector as d = ^-x ^ and the total supply vector as e = e ® 
and say the market clears if d = e. In the graphical setting, the concept of clearance 
is applied to each “commodity h sold by i”, so we have a local notion of clearance, in 
which all the goods sold by each consumer clear in the neighborhood. Define the local 

* Throughout the paper we describe the model and results in the setting where the graph constrains 
exchange between individual consumers, but everything generalizes to the case in which the 
vertices are themselves complete AD economies, and the graph is viewed as representing trade 
agreements. 
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demand vector d ® G 7^^ on consumer iasd^ = x The clearance condition 

is for each i, d * = e *. 

A market or graphical equilibrium is a set of prices and plans in which all plans are 
optimal at the current prices and in which the market clears. We note that the notions of 
traditional AD and graphical equilibria coincide when the graph is fully connected. 

As with the original notion of AD equilibria, it is important to establish the general 
existence of graphical equilibria. Also as with the original notion, in order to prove the 
existence of equilibria, two natural technical assumptions are required, one on the utility 
functions and the other on the endowments. We begin with the assumption on utilities. 
Assumption I: For all consumers i, the utility function Ui satisfies the following three 
properties: 

- (Continuity) Ui is a continuous function. 

- (Monotonicity) Ui is strictly monotonically increasing with each commodity. 

- (Quasi-Concavity) If Ui{x ') > Ui(x) then Ui{ax ' + (1 — o;)a;) > Ui{x) for all 
0 < a < 1. 

The monotonicity assumption is somewhat stronger than the original “non-satiability” 
assumption made by AD, but is made primarily for expository purposes. Our results can 
be generalized to the original assumption as well. 

The following facts follow from Assumption I and the consumers’ rationality: 

1. At equilibrium, the budget constraint inequality for consumer i is saturated, e.g., in 
a standard AD economy, a consumer using an equilibrium plan x * spends all the 
money obtained from the sale of the endowment e ®. 

2. In any graphical equilibrium, a consumer only purchases a commodity at the cheap- 
est price among the neighboring consumers. Note that the neighboring consumer 
with the cheapest price may not be unique. 

Assumption II: (Non-Zero Endowments) For each consumer i and good h, el > 0. 

The seminal theorem of Arrow and Debreu [1954] states that these assumptions are 
sufficient to ensure existence of a market equilibrium. However, this theorem does not 
immediately imply existence of an equilibrium in a graphical economy, due to the re- 
stricted nature of trade. Essentially, Assumption II in the AD setting implies that each 
consumer owns a positive amount of every good in the economy. In the graphical setting, 
there are effectively nk goods, but each consumer only has an endowment in k of them. 
To put it another way, consumer i may only obtain income from selling goods at the k 
local prices p *, and is not able to sell any of its endowment at prices p ^ for j ^ i. 

Nevertheless, Assumptions I and II still turn out to be sufficient to allow us to prove 
the following graph-theoretic equilibrium existence theorem. 

Theorem 1. (Graphical Equilibria Existence) For any graphical economy in which 
Assumptions I and II hold, there exists a graphical equilibrium. 



Before proving existence, let us examine these equilibria with some examples. 
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Fig. 1. Price variation and the exchange subgraph at graphical equilihrium in a preferential attach- 
ment network. See text for description. 



4.1 Local Price Variation at Graphical Equilibrium 

To illustrate the concept of graphical equilibrium and its difference with the traditional 
AD notion, we now provide an example in which local price differences occur at equi- 
librium. The economy consists of three consumers, ci , ci and C 3 , and two goods, gi and 
P 2 - The graph of the economy is the line ci — C2 — C3. 

The utility functions for all three consumers are linear. Consumer ci has linear utility 
for gi with coefficient 1 , and zero utility for g 2 . Consumer C2 has linear utility for both 
gi and g 2 , with both coefficients 1. Consumer C3, has zero utility for gi, and linear utility 
for g 2 with coefficient 1. The endowments (ci, 62) for gi and g 2 for the consumers are 
as follows: (1, 2) for ci, (1, 1) for C2, and (2, 1) for C3. 

We claim that the following local prices {pi,P 2 ) for gi and 52 constitute a graphical 
equilibrium: prices ( 2 , 1 ) to purchase from ci, ( 2 , 2 ) to purchase from C 2 , and ( 1 , 2 ) to 
purchase from C3. It can also be shown that there is no graphical equilibrium in which 
the prices for both goods is the same from all consumers, so price variations are essential 
for equilibrium. We leave the verification of these claims as an exercise for the interested 
reader. 

Essentially, in this example, ci and C 3 would like to exchange goods, but the graphical 
structure prohibits direct trade. Consumer C 2 , however, is indifferent to the two goods. 
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and thus acts as a kind of arbitrage agent, selling each of ci and C 2 their desired good at 
a high price, while buying their undesired good at a low price. 

A more elaborate and interesting equilibrium computation which also contains price 
variation is shown in Figure 4. 1 . In this graph, there are 20 buyers and 20 sellers (labeled 
by ‘B’ or ‘S’ respectively, followed by an index). The bipartite connectivity structure (in 
which edges are only between buyers and sellers) was generated according to a statistical 
model known as preferential attachment (Barabasi and Albert [1999]), which accounts 
for the heavy-tailed distribution of degrees often found in real social and economic 
networks. All buyers have a single unit of currency and utility only for an abstract good, 
while all sellers have a single unit of this good and utility only for currency. Each seller 
vertex is labeled with the price they charge at graphical equilibrium. Note that in this 
example, there is non-trivial price variation, with the most fortunate sellers charging 
1.50 for the unit of the good, and the least fortunate 0.67. 

The black edges in the figure show the exchange subgraph — those pairs of buyers 
and sellers who actually exchange currency and goods at equilibrium. Note the sparseness 
of this graph compared to the overall graph. The yellow edges (the most faint in a black 
and white version) are edges of the original graph that are unused at equilibrium because 
they represent inferior prices for the buyers, while the dashed edges are edges of the 
original graph that have competitive prices, but are unused at equilibrium due to the local 
market clearance conditions. 

In a forthcoming paper (Kakade et al. [2004]) we report on a series of large-scale 
computational experiments of this kind. 

4.2 Proof of Graphical Equilibrium Existence 

For reasons primarily related to Assumption II, the proof uses the interesting concept 
of a “quasi-equilibrium”, originally defined by Debreu [1962] in work a decade after 
his seminal existence result with Arrow. It turns out that much previous work has gone 
into weakening this assumption in the AD setting. If this assumption is not present, then 
Debreu [1962] shows that although true equilibria may not exist, “quasi-equilibrium” 
still exist. In a quasi-equilibrium, consumers with 0 wealth are allowed to be irrational. 

Our proof proceeds by establishing the existence of a quasi-equilibria in the graphical 
setting, and then showing that this in fact implies existence of graphical equilibria. This 
last step involves a graph-theoretic argument showing that every consumer has positive 
wealth. 

A “graphical quasi-equilibrium” is defined as follows. 

Definition 1. A graphical quasi-equilibrium/or a graphical economy is a set of globally 
normalized prices P (i.e. ^ = Ij and a set of consumption plans {A*}, in which 

the local markets clear and for each consumer i, with wealth lu* = p * • e the following 
condition holds: 

— (Rational) If consumer i has positive wealth (w* > 0), then i is rational (utility- 
maximizing). 

— (Quasi-Rational) Else if has no wealth (ru* = 0), then the plan AT* is only budget 
constrained (and does not necessarily maximize utility). 
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Lemma 1. ( Graphical Quasi-Equilibria Existence) In any graphical economy in which 
Assumption I holds, there exists a graphical quasi-equilibrium. 

The proof is straightforward and is provided in a longer version of this paper. Note 
that if all consumers have positive wealth at a quasi-equilibrium, then all consumers are 
rational. Hence, to complete the proof of Theorem 1 it suffices to prove that all consumers 
have positive wealth at a quasi-equilibrium. For this we provide the following lemma, 
which demonstrates how wealth propagates in the graph. 

Lemma 2. If the graph of a graphical economy is connected and if Assumptions I and 
II hold, then for any quasi-equilibrium set of prices {p *}, it holds that every consumer 
has non-zero wealth. 



Proof. Note that by price normalization, there exists at least one consumer that has one 
commodity with non-zero price. We now show that if for any consumer i, p * ^ 0, then 
this implies that for all j G N{i), ^0. This is sufficient to prove the result, since 

the graph is assumed to be connected and e * > 0. 

Let {X*} and {p *} be a quasi-equilibrium. Assume that in some i, p * ^ 0. Since 
every consumer has positive endowments in each commodity (Assumption II), p * • e * > 
0, and so consumer i is rational. By Fact 1, the budget constraint inequality of i must be 
saturated, so N{i)P^ =p * • e * > 0. Hence, there must exist a commodity 

h and a j G N{i) such that > 0 and ^ 0, else the money spent would be 0. In 

other words, there must exist a commodity that is consumed by i from a neighbor at a 
non-zero price. 

The rationality of i implies that consumer j has the cheapest price for the commodity 
h, otherwise i would buy h from a cheaper neighbor (Fact 2). More formally, j G 
argmin^gjv(i) pi, which implies for all £ G N{i), > 0. Thus we have shown 

that for all £ G N (i) , p ^ 0, and since by Assumption II, e * > 0, this completes the 

proof. □ 

Without graph connectivity, it is possible that all the consumers in a disconnected 
graph could have zero wealth at a quasi-equilibrium. Hence, to complete the proof of 
Theorem 1, we observe that in each connected region we have a separate graphical 
equilibria. 

It turns out that the “propagation” argument in the previous proof, with more careful 
accounting, actually leads to a quantitative lower bound on consumer wealth in a graph- 
ical economy, which we now present. This lower bound is particularly useful when we 
turn towards computational issues in a moment. 

The following definitions are needed: 

e+ = max , e_ = min 
i,h i,h 

Note that Assumption II implies that e_ > 0. 



Lemma 3. (Wealth Propagation) In a graphical economy, in which Assumptions I and 
II hold, with a connected graph of degree m — 1, the wealth of any consumer i at 
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equilibrium prices {p *} is bounded as follows: 



diairieter{G) 

— > 0 
n 

The proof is provided in the long version of this paper. 

Interestingly, note that a graph that maximizes free trade (i. e. a fully connected graph) 
maximizes this lower bound on the wealth of a consumer. 




5 Algorithms for Computing Economic Equilibria 

All of our algorithmic results compute approximate, rather than exact, economic equi- 
libria. We first give the requisite definitions. We use the natural definition originally 
presented in Deng et al. [2002]. First, two concepts are useful to define — approximate 
optimality and approximate clearance. A plan is e-optimal at some price P if the plans 
are budget constrained under P and if the utility of the plan is at least 1 — £ times the op- 
timal utility under P. The market e-clears if, in the standard setting, (1 — e)e<d<e 
and, in the graphical setting, for all z, (1 — £)e * < d * < e T Now we say a set of 
plans and prices constitute an e-equilibrium if the market e-clears and if the plans are 
e-optimal. ^ 

The algorithms we present search for an approximate ADE on a discretized grid. 
Hence, we need some sort of “smoothness” condition on the utility function in order for 
the discretized grid to be a good approximation to the true space. More formally, 
Assumption III: We assume there is exists 7 > 0 such that for all i and for all x 

Ui{ (1 + 7)35 ) < exp{'-fd)ui{x) 



for some constant d. 

Note that for polynomials with positive weights, the constant d can be taken to be the 
degree of the polynomial. Essentially, the condition states that if a consumer increases 
his consumption plan by some multiplicative factor 7 , then his utility cannot increase by 
the exponentially larger, multiplicative factor of exp( 7 <i). This condition is a natural one 
to consider, since the “growth rate” constant d is dimensionless (unlike the derivative of 
the utility function duijdx, which has units of utility /goods). 

Naturally, for reasons of computational generality, we make a “black box” represen- 
tational assumption on the utility functions. 

Assumption IV: We assume that for all z, the utility function Ui is given as an oracle, 
which given an input x % outputs Ui{x *) in unit time. 

Eor the remainder of the paper, we assume that Assumptions I-IV hold. 

^ It turns out that any e-approximate equilibrium in our setting with monotonically increasing 
utility functions can be transformed into an approximate equilibrium in which the market exactly 
clears while the plans are still e-optimal. To see this note that the cost of the unsold goods is 
equal to the surplus money in the consumers’ budgets. The monotonicity assumption allows 
us to increase the consumption plans, using the surplus money, to take up the excess supply 
without decreasing utilities. This transformation is in general not possible if we weaken the 
monotonicity assumption to a non-satiability assumption. 




Graphical Economics 



27 



5.1 An Improved Algorithm for Computing AD Equilibria 

We now present an algorithm for computing AD equilibria for rather general utility 
functions in the non-graphical setting. The algorithm is a generalization of the algorithm 
provided by Deng et al. [2002], which computes equilibria for the case in which the 
utilities are linear functions. While our primary interest in this algorithm is as a subroutine 
for the graphical algorithm presented in Section 5.3, it is also of independent interest. 

The idea of the algorithm is as follows. For each consumer i, a binary valued “best- 
response” table Mi{p, x) is computed, where the indices p and x are prices and plans. 
The value of Mi{p, x) is set to 1 if and only if x is e-optimal for consumer i at prices 
p. Once these tables are computed, the “price player’s” task is then to find p and {x *} 
such that (1 — e)e < d < e and for all i, Mi{p, a; *) = 1. 

To keep the tables of Mi of finite size, we only consider prices and plans on a grid. As 
in Deng et al. [2002] and Papadimitriou and Yannakakis [2000], we consider a relative 
grid of the form: 

^price {fo; (1 T (1 T £) po? ■ • ■ 5 1 } 5 
^plan {^0? (1 “f (1 “f £:) Xg, ■ ■ ■ , 

where the maximal grid price is 1 and maximal grid plan is ne+ (since there is at most 
an amount ne+ of any good in the market). The intuitive reason for the use of a relative 
grid is that demand is more sensitive to price perturbations of cheaper priced goods, 
since consumers have more purchasing power for these goods. 

In Section 5.2, we sketch the necessary approximation scheme, which shows how to 
set po and xq such that an £-equilibrium on this grid exists. The natural method to set po 
is to use a lower bound on the equilibrium prices. Unfortunately, under rather general 
conditions, only the trivial lower bound of 0 is possible. However, we can set po and xq 
based on a non-trivial wealth bound. 

Now let us sketch how we use the tables to compute an e-equilibrium. Essentially, the 
task now lies in checking that the demand vector d is close to e for a set of plans and prices 
which are true for the Mi. As in Deng et al. [2002], a form of dynamic programming 
suffices. Consider a binary, “partial sum of demand” fable Si{p, x) defined as follows: 
Si{p, d) = 1 if and only if there exists . . . , x* such that d = x^ + x'^ + . . . + x^ 
and Mi{p, = 1, . . . , Mi{p, x*) = 1. These tables can be computed recursively as 
follows: if Si-i{p, d) = 1 and if Mi{p, x) = 1, then we set Si{p, x + d) = 1. Further, 
we keep track of a “witness” x^ , ,x^ which proves that the table entry is 1. The 
approximation lemmas in Section 5.2 show how to keep this table of finite “small” size 
(see also long version of the paper). 

Once we have Sn, we just search for some index p and d such that Sn{p, d) = 1 
and d K, e. This p and the corresponding witness plans then constitute an equilibrium. 
The time complexity of this algorithm is polynomial in the tables sizes, which we shall 
see is of polynomial size for a fixed k. This gives rise to the following theorem. 

Theorem 2. For fixed k, there exists an algorithm which takes as input an AD economy 
and outputs an e-equilibrium in time polynomial in n, 1/e, log (e+/e_), and d. 

The approximation details and proof are provided in the long version of this paper. 
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5.2 Approximate Equilibria on a Relative Grid 

We now describe a relative discretization scheme for prices and consumption plans that is 
used by the algorithm just described for computing equilibria in classical (non-graphical) 
AD economies. This scheme can be generalized for the graphical setting, but is easier 
to understand in the standard setting. 

Without loss of generality, throughout this section we assume the prices in a market 
are globally normalized, i.e. ^f^Ph = 1- 

A price and consumption plan can be mapped onto the relative grid in the obvious 
way. Define grid{p) G TZ^ to be the closest price to p such that each component of 
grid{p) is on the price grid. Hence, 

T^—p < grid{p) < max{(l + e)p, pol} 

1 + e 

where the max is taken component-wise and 1 is a fc-length vector of all ones. Note that 
the value of pq is a threshold where all prices below po get set to this threshold price. 
Similarly, for any consumption plan x *, let grid{x *) be the closest plan to x * such 
that grid{x *) is componentwise on 

In order for such a discretization scheme to work, we require two properties. First, 
the grid should certainly contain an approximate equilibrium of the desired accuracy. We 
shall refer to this property a?, Approximate Completeness (of the grid). Second, and more 
subtly, it should also be the case that maximizing consumer utility, while constrained to 
the grid, results in utilities close to those achieved by the unconstrained maximization 
— otherwise, our grid-restricted search for equilibria might result in highly suboptimal 
consumer plans. We shall refer to this property as Approximate Soundness (of the grid). 
It turns out that Approximate Soundness only holds if prices ensure a minimum level of 
wealth for each consumer, but conveniently we shall always be in such a situation due 
to Lemma 3. 

The next two lemmas establish Approximate Completeness and Soundness for the 
grid. The Approximate Completeness Lemma also states how to set po and xg- It is 
straightforward to show that if we have a lower bound on the price at equilibrium, then 
Po can be set to this lower bound. Unfortunately, it turns out that under our rather general 
conditions we cannot provide a lower bound. Instead, as the lemmas show, it is sufficient 
to use a lower bound wq on the wealth of any consumer at equilibrium, and set po and 
xq based on this wealth. Note that in the traditional AD setting e_ is a bound on the 
wealth, since the prices are normalized. 

Lemma 4. (Approximate Completeness) Let the grids QpfiQg and Gpi^fi be defined 
using 

= nt = (1 + L)nk^^ 

where Wq is a lower bound on equilibrium wealth of all consumers and let **} and 
{p**} be equilibrium prices and plans. Then the plans {x * = grid 
are I9de approximately optimal for the price p = grid(p*) and the market 14e- 
approximately clears. Furthermore, a useful property of this approximate equilibrium 
is that every consumer has wealth greater than 
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There are a number of important subtleties to be addressed in the proof, which we 
formally present in the longer version. For instance, note that the closest point on the 
grid to some true equilibria may not even be budget constrained. 

Lemma 5. (Approximate Soundness) Let the grid be defined as in Theorem 4 and let 
p be on the grid such that every consumer has wealth above If the plans {a; *} 
(3-approximately maximize utility over the budget constrained plans which are compo- 
nentwise on the grid, i.e. if for all budget constrained x '* which lie on the plan grid, 

Ui{x *)>(!- f3)ufix '*) . 



then 

Ui(x *) > (1 — (/3 + 4ed))u* 
where u* is the optimal utility under p. 

5.3 Arrow-Debreu Propagation for Graphical Equilibria 

We now turn to the problem of computing equilibria in graphical economies. We present 
the ADProp algorithm, which is a dynamic programming, message-passing, algorithm 
for computing approximate graphical equilibria when the graph has a tree structure. 
Recall that in a graphical economy there are effectively nk goods, so we cannot keep 
the number of goods fixed as we scale the number of consumers. Hence, the algorithm 
described in the previous section cannot be directly applied if we wish to scale polyno- 
mially with the number of consumers. 

As we will see from the description of ADProp below, an appealing conceptual 
property of the algorithm is how it achieves the computation of global economic equi- 
libria in a distributed manner through the local exchange of economic trade and price 
information between just the neighbors in the graph. 

We orient the graph such that “downstream” from a vertex lies the root and “up- 
stream” lies the leaves. For any consumer j that is not the root there exists a unique 
downstream consumer, say I. Let UP(j) be the set of neighbors of j which are not 
downstream, i.e. UP(j) is the set N(j) — {£} so it includes j itself. 

We now define a binary valued fable Tgj, which can be viewed as the message 
that consumer j € UP{£) sends downstream to 1. The table Tij(p ^ ,x^fpfx is 

indexed by the prices for £ and j and the consumption that flows along the edge between 
£ and j — from £ to j, the consumption is x and from j to £, the consumption is 
X The table entry Tij(p ^,x ^fpfx evaluates to 1 if and only if there exists a 
conditional ^-equilibria upstream from j (inclusive) in which the respective prices and 
plans are fixed to p ^ ,x ^fp £ ,x . For the special case where j = £, the table entry 
Tjj(p ^,x ,p £ ,x £^) is set to 1 if and only if p^ = p£ and x^£ = x £^ (note that 
X ££ is effectively the amount of the goods that j desires not to sell). 

The tables provide all the information needed to apply dynamic programming in 
the obvious way. In its downstream pass, ADProp computes the table Tij recur- 
sively, in the typical dynamic programming fashion. If j is an internal node in the 
tree, when j has received the appropriate tables from all i G UP(j), we must set 
Tij(p ^ , X ^£ ,p£ ,x £^) = 1, if: 1) a conditional upstream equilibrium exists, which we 
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can computed from the tables passed to j, 2) the plan , consistent with the upstream 
equilibrium, is £-optimal for the neighborhood prices, and 3) the market £-clear at j. 
Naturally, a special but similar operation occurs at the leaves and the root of the tree. 

Once ADProp computes the message at the root consumer, it performs an upstream 
pass to obtain a single graphical equilibrium, again, in the typical dynamic programming 
fashion. At every node, starting with the root, ADProp selects price and allocation 
assignments consistent with the tables at the node and passes those assignments up to 
their upstream neighbors, until it reaches the leaves of the tree. 

As presented in Section 5.2, we can control the approximation error by using appro- 
priate sized grids. This leads to our main theorem for computing graphical equilibrium. 

Theorems. (ADProp) For fixed k and graph degree, ADProp takes as input a tree 
graphical economy in which Assumptions I-IV hold and outputs an e-equilibrium in 
time polynomial in n, 1/e, log (e+/e_), and d. 

Heuristic generalizations of ADProp are possible to handle more complex (loopy) 
graph structures (a la NashProp [Ortiz and Kearns, 2003]). 



6 Learning in Graphical Games and Economics 

Although the work described here has focused primarily on the graphical economics 
representation, and algorithms for equilibrium computation, the general area of graphical 
models for economic and strategic settings is rich with challenging learning problems 
and issues. We conclude by mentioning just a few of these. 

Rational Learning in Graphical Games. What happens if each player in a repeated 
graphical game plays according to some “rational” dynamics (like fictitious play, best 
response, or other variants), but using only local observations (the actions of neighbors)? 
In cases where convergence occurs, how does the graph structure influence the equilib- 
rium chosen? Are there particular topological properties that favor certain players in the 
network? 

No-Regret Learning in Graphical Games. It has recently been established that if all 
players in a repeated graphical game play a local no internal regret algorithm, the pop- 
ulation empirical play will converge to the set of correlated equilibria. It was also noted 
in the introduction that all such equilibrium can be represented up to payoff equivalence 
on a related Markov network; under what conditions will no-regret learning dynamics 
actually settle on one of these succinct equilibria? In preliminary experiments using the 
algorithms of Foster and Vohra [1999] as well as those of Hart and Mas-Colell [200] 
and Hart and Mas-Colell [2001], one does not observe convergence to the set of payoff- 
equivalent Markov network correlated equilibria. 

Learning in Traditional AD Economies. Even in the non-graphical Arrow-Debreu 
setting, little is known about reasonable distributed learning procedures. Aside from a 
strong (impossibility) result by Saarl and Simon [1978] suggesting that general conver- 
gence results may not be possible, there is considerable open territory here. Conceptual 
challenges include the manner in which the “price player” should be modeled in the 
learning process. 
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Learning in Graphical Economics. Finally, problems of learning in the graphical eco- 
nomics model are entirely open, including the analogues to all of the questions above. 
Generally speaking, one would like to formulate reasonable procedures for local learn- 
ing (adjustment of seller prices and buyer purchasing decisions), and examine how these 
procedures are influenced hy network structure. 
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Abstract. We provide a natural learning process in which the joint fre- 
quency of empirical play converges into the set of convex combinations 
of Nash equilibria. In this process, all players rationally choose their ac- 
tions using a public prediction made by a deterministic, weakly calibrated 
algorithm. Furthermore, the public predictions used in any given round 
of play are frequently close to some Nash equilibrium of the game. 



1 Introduction 

Perhaps the most central question for justifying any game theoretic equilibrium 
as a general solution concept is: can we view the equilibrium as a convergent 
point of a sensible learning process? Unfortunately for Nash equilibria, there 
are currently no learning algorithms in the literature in which play generally 
converges (in some sense) to a Nash equilibrium of the one shot game, short of 
exhaustive search — see Foster and Young [ming] for perhaps the most general 
result in which players sensibly search through hypothesis. In contrast, there is a 
long list of special cases ( eg zero sum games, 2x2 games, assumptions about the 
players’ prior subjective beliefs) in which there exist learning algorithms that 
have been shown to converge (a representative but far from exhaustive list would 
be Robinson [1951], Milgrom and Roberts [1991], Kalai and Lehrer [1993], 
Fudenberg and Levine [1998], Freund and Schapire [1999]). 

If we desire that the mixed strategies themselves converge to a Nash equi- 
librium, then a recent result by Hart and Mas-Colell [2003] shows that this 
is, in general, not possible under a certain class of learning rules Instead, 
one can examine the convergence of the joint frequency of the empirical play, 
which has the advantage of being an observable quantity. This has worked 
well in the case of a similar equilibrium concept, namely correlated equilib- 
rium (Foster and Vohra [1997], Hart and Mas-Colell [2000]). However, for Nash 
equilibria, previous general results even for this weaker form of convergence are 
limited to some form of exhaustive search (though see Foster and Young [ming]). 

In this paper, we provide a learning process in which the joint frequency of 
empirical play converges to a Nash equilibrium, if it is unique. More generally, 
convergence is into the set of convex combinations of Nash equilibria (where 

^ They show that, in general, there exists no continuous time dynamics which converge 
to a Nash equilibrium (even if the equilibrium is unique), with the natural restriction 
that a players mixed strategy is updated without using the knowledge of the other 
players’ utility functions. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 33—48, 2004. 
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the empirical play could jump from one Nash equilibrium to another infinitely 
often). Our learning process is the most traditional one: players make predictions 
of their opponents and take best responses to their predictions. Central to our 
learning process is the use of public predictions formed by an “accurate” {eg 
calibrated) prediction algorithm. 

We now outline the main contributions of this paper. 

“Almost” Deterministic Calibration. Formulating sensible prediction al- 
gorithms is a notoriously difficult task in the game theoretic setting A rather 
minimal requirement for any prediction algorithm is that it should be calibrated 
(see Dawid [1982]). An informal explanation of calibration would go something 
like this. Suppose each day a weather forecaster makes some prediction, say p, 
of the chance that it rains the next day. Now from the subsequence of days 
on which the forecaster announced p, compute the empirical frequency that it 
actually rained the next day, and call this p{p). Crudely speaking, calibration 
requires that p{p) equal p, if the forecast p is used often. 

If the weather acts adversarially, then Oakes [1985] and Dawid [1985] show 
that a deterministic forecasting algorithm will not be always be calibrated. How- 
ever, Foster and Vohra [1998] show that calibration is almost surely guaranteed 
with a randomized forecasting rule, ie where the forecasts are chosen using pri- 
vate randomization and the forecasts are hidden from the weather until the 
weather makes its decision to rain or not. Of course, this solution makes it dif- 
ficult for a weather forecaster to publicly announce a prediction. 

Although stronger notions of calibration have been proposed (see 
Kalai et al. [1999]), here we actually consider a weaker notion Our contri- 
bution is to provide a deterministic algorithm that is always weakly calibrated. 
Rather than precisely defining weak calibration here, we continue to with our ex- 
ample to show how this deterministic algorithm can be used to obtain calibrated 
forecasts in the standard sense. 

Assume the weather forecaster uses our deterministic algorithm and publicly 
announces forecasts to a number of observers interested in the weather. Say the 
following forecasts are made over some period of 5 days: 

0.8606, 0.2387, 0.57513, 0.4005, 0.069632, ... 

How can an interested observer make calibrated predictions using this announced 
forecast? In our setting, an observer can just randomly round the forecasts in or- 
der to calibrate. For example, if the observer rounds to the second digit, then on 
the first day, the observer will privately predict .87 with probability .06 and .86 
otherwise, and, on the second day, the private predictions will be 0.24 with prob- 
ability 0.87 and 0.23 otherwise. Under this scheme, the asymptotic calibration 

^ Subjective notions of probability fall prey to a host of impossibility resnlts — crudely, 
Alice wants to predict Bob while Bob wants to predict Alice, which leads to a 
feedback loop (if Alice and Bob are both rational). See Foster and Young [2001]. 

® We use the word “weak” in the technical sense of weak convergence of measures (see 
Billingsley [1968]) rather than how it used by Kalai et al. [1999]. 
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error of the observer will, almost surely, be small (and if the observer rounded 
to the third digit, this error would be yet even smaller). 

Unlike previous calibrated algorithms, this deterministic algorithm provides 
a meaningful forecast, which can be calibrated using only randomized rounding. 



Nash Convergence. The existence of a deterministic forecasting scheme leaves 
open the possibility that all players can rationally use some public forecast, 
since each player is guaranteed to form calibrated predictions {regardless of how 
the other players behave). For example, say some public forecaster provides a 
prediction of the full joint distribution of all n players. The algorithm discussed 
above can be generalized such that each player can use this prediction (with 
randomized rounding) to construct a prediction of the other players. Each player 
can then use their own prediction to choose a best response. 

We formalize this scheme later, but point out that our (weakly) calibrated 
forecasting algorithm only needs to observe the history of play (and does not 
require any information about the players’ utility functions) . Furthermore, there 
need not be any “publicly announced” forecast provided to every player at each 
round — alternatively, each player could have knowledge of the deterministic 
forecasting algorithm and could perform the computation themselves. 

Now Foster and Vohra [1997] showed that if players make predictions that 
satisfy the rather minimal calibration condition, then the joint frequency of the 
empirical play converges into the set of correlated equilibria. Hence, it is im- 
mediate that in our setting, convergence is into the set of correlated equilibria. 
However, we can prove the stronger condition that the joint frequency of em- 
pirical play converges into the set of convex combinations of Nash equilibria, 
a smaller set than that of correlated equilibria. This directly implies that the 
average payoff achieved by each player is at least the player’s payoff under some 
Nash equilibrium — a stronger guarantee than achieving a (possibly smaller) 
correlated equilibrium payoff. 

This setting deals with the coordination problem of “which Nash equilibrium 
to play?” in a natural manner. The setting does not arbitrarily force play to any 
single equilibrium and allows the possibility that players could (jointly) switch 
play from one Nash equilibrium to another — perhaps infinitely often. Further- 
more, although play converges to the convex combinations of Nash equilibria, 
we have the stronger result that the public forecasts themselves are frequently 
close to some Nash equilibrium {not general combinations of them). Of course if 
the Nash equilibrium is unique, then the empirical play converges to it. 

The convergence rate, until the empirical play is an approximate Nash equi- 
librium, is 0{\/T) (where T is the number of rounds of play), with constants 
that are exponential in both the number of players and actions. Hence, our set- 
ting does not lead to a polynomial time algorithm for computing an approximate 
Nash equilibrium (which is currently an important open problem). 




36 



S.M. Kakade and D.P. Foster 



2 Deterministic Calibration 

We first describe the online prediction setting. There is a finite outcome space 
fl = {1,2,... |i7|}. Let X be an infinite sequence of outcomes, whose t-th 
element, W, indicates the outcome on time t. For convenience, we represent 
the outcome Xt = {Xt[l\, Xt\^\, . . . Xt[\Q\\) as a binary vector in |0,1}I^I that 
indicates which state at time t was realized — if the realized state was i, then the 
z-th component of Xt is 1 and all other components are 0. Hence, i® 

the empirical frequency of the outcomes up to time T and is a valid probability 
distribution. 

A forecasting method, F, is simply a function from a sequence of outcomes to 
a probability distribution over 17. The forecast that F makes in time t is denoted 
by ft = ^(Ai, A2, . . . ,Xt-i) (clearly, the t-th forecast must be made without 
knowledge of Xt). Here ft = (/t [1] , /* [2] , . . . /t[|l7|]), where the zth component is 
the forecasted probability that state z will be realized in time t. 



2.1 Weak Calibration 



We now define a quantity to determine if F is calibrated with respect to some 
probability distribution p. Define Ip,e{f) to be a “test” function indicating if the 
forecast / is e-close to p, ze 



ivAf) 



1 if 1/ -p| < e 

0 else 



where |/| denotes the l\ norm, ze |/| = |/[^]|- define the calibration 

error /zt of F with respect to as: 



PT{Ip,e,X,F) = - ft) 

^ t=l 

Note that Xt — ft is the immediate error (which is a vector) and the above 
error /zt measures this instantaneous error on those times when the forecast was 
e-close to p. 

We say that F is calibrated if for all sequences X and all test functions Ip^^, 
the calibration error tends to 0, ze 

I^T{Ip,e, X, F) ^ 0 

as T tends to infinity. As discussed in the Introduction, there exist no de- 
terministic rules F that are calibrated (Dawid [1985], Oakes [1985]). However, 
Foster and Vohra [1998] show that there exist randomized forecasting rules F 
(ze F' is a randomized function) which are calibrated. Namely, there exists a 
randomized F such that for all sequences X and for all test functions Ip^^, the 
error piT{Ip,eTX,F) — >■ 0 as T tends to infinity, with probability 1 (where the 
probability is taken with respect to the randomization used by the forecasting 
scheme) . 




Deterministic Calibration and Nash Equilibrium 



37 



We now generalize this definition of the calibration error by defining it with 
respect to arbitrary test functions w, where a test function is defined as a map- 
ping from probability distributions into the interval [0, 1]. We define the calibra- 
tion error of F with respect to the test function w as: 

1 ^ 

Ht{w,X,F) = -^w(/t)(Xt - ft) 

^ t=i 

This is consistent with the previous definition if we set w=Ip^e- 

Let W be the set of all test functions which are Lipschitz continuous func- 
tions We say that F is weakly calibrated if for all sequences X and all w € W, 

f^T(w,X,F) -)> 0 

as T tends to infinity. Also, we say that F is uniformly, weakly calibrated if for 
all w G W, 

sup ^t(w, X, F) — >• 0 

as T tends to infinity. The latter condition is strictly stronger. Our first main 
result follows. 

Theorem 1. (Deterministic Calibration) There exists a deterministic forecast- 
ing rule which is uniformly, weakly calibrated. 

The proof of this theorem is constructive and is presented in section 4. 



2.2 Randomized Rounding for Standard Calibration 

We now show how to achieve calibration in the standard sense (with respect to 
the indicator functions using a deterministic weakly calibrated algorithm 

along with some randomized rounding. Essentially, the algorithm rounds any 
forecast to some element in a finite set, V, of forecasts. In the example in the 
Introduction, the set V was the set of probability distributions which are specified 
up to the second digit of precision. 

Let A be the simplex in which the forecasts live {A C Consider some 

triangulation of A. By this, we mean that A is partitioned into a set of simplices 
such that any two simplices intersect in either a common face, common vertex, 
or not at all. Let V be the vertex set of this triangulation. Note that any point 
p lies in some simplex in this triangulation, and, slightly abusing notation, let 
V (p) be the set of corners for this simplex ® . Informally, our rounding scheme 
rounds a point p to nearby points inV — p will be randomly mapped into V (p) 
in the natural manner. 

^ The function g is Lipschitz continuous if g is continuous and if there exists a finite 
constant A such that \g{a) — g{b)\ < A|a — b\. 

® If this simplex is not unique, ie if p lies on a face, then choose any adjacent simplex 
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To formalize this, associate a test function Wy{p) with each u G F as follows. 
Each distribution p can be uniquely written as a weighted average of its neigh- 
boring vertices, V(p). For v &V{p), let us define the test functions Wv{p) to be 
these linear weights, so they are uniquely defined by the linear equation: 

^ Wy{p)v. 
vev{p) 

For V ^V{p), we define Wv{p) = 0. A useful property is that 

^ Wy{p) = ^ Wy{p) = 1 
v^V(p) v^V 

which holds since p is an average (under w„) of the points in V{p). 

The functions imply a natural randomized rounding function. Define 
the randomized rounding function Roundy as follows: for some distribution p, 
Roundvip) chooses v € V{p) with probability Wy{p). We make the following 
assumptions about a randomized rounding forecasting rule Fy with respect to 
F and triangulation V : 

1. F is weakly calibrated. 

2. If F makes the forecast ft at time t, then Fy makes the random forecast 
Roundy(ft) at this time. 

3. The (h) diameter of any simplex in the triangulation is less than e, ie for 
any p and q in the same simplex, \p — q\ < e. 

An immediate corollary to the previous theorem is that Fy is e-calibrated with 
respect to the indicator test functions. 

Corollary 1. For all X, the calibration error of Fy is asymptotically less than 
e, ie the probability (taken with respect to the randomization used by Roundy) 
that 

\hTi.Ip,e,X,Fy) \ < e 



tends to 1 as T tends to infinity. 

To see this, note that the instantaneous error at time t, Xt~ Roundy (ft), has 
an expected value of Wy{ft){Xt - v) which is e-close to Wv{ft){Xt - ft). 
The sum of this latter quantity converges to 0 by the previous theorem. The 
(martingale) strong law of large numbers then suffices to prove this corollary. 

This randomized scheme is “almost deterministic” in the sense that at each 
time t the forecast made by Fy is e-close to a deterministic forecast. Interestingly, 
this shows that an adversarial nature cannot foil the forecaster, even if nature 
almost knows the forecast that will be used every round. 
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3 Publicly Calibrated Learning 

First, some definitions are in order. Consider a game with n players. Each player 
i has a finite action space At- The joint action space is then A = n'^L^Ai- As- 
sociated with each player is a payoff function Ui \ A ^ [0, 1]. The interpretation 
is that if the joint action a G ^ is taken by all players then player i will receive 
payoff Ui{a). 

If p is a joint distribution over A-i = Uj^iAj, then we define BRi{p) to be 
the set of all actions which are best responses for player i to p, ie it is the set 
of all a & Ai which maximize the function a_i)]- It is also useful 

to define e-BRi{p) as the set of all actions which are e-best responses to p, ie if 
a € e-BRi{p) then the utility Ea_i,^p[ui(a, a_i)] is e-close to the maximal utility 
maXa'gA Ea_^r^p[u,{a' , a_d]- 

Given some distribution / over A, it is convenient to denote the marginal 
distribution of / over A-i as f-i. We say a distribution / is a Nash equilibrium 
(or, respectively, e-Nash equilibrium) if the following two conditions hold: 

1. / is a product distribution. 

2. If action a £ Ai has positive probability under / then a is in BRi{f_i) (or, 
respectively, in e-BRi(f-i)). 

We denote the set of all Nash equilibria (or e-Nash equilibria) by NE (or NE^). 



3.1 Using Public Forecasts 

A standard setting for learning in games is for each player i to make some forecast 
p over A-i at time t. The action taken by player i during this time would then 
be some action that is a best response to p. 

Now consider the setting in which all players observe some forecast ft over all 
n players, ie the forecast ft is a full joint probability distribution over = A. 
Each player is only interested in the prediction of other players, so player i 
can just use the marginal distribution (ft)-i to form a prediction for the other 
players. In order to calibrate, some randomized rounding is in order. 

More formally, we define the public learning process with respect to a fore- 
casting rule E and vertex set V as follows: At each time t, E provides a prediction 
ft and each player i: 

1. makes a prediction p = Roundvift) 

2. chooses a best response to p-i 

We make the following assumptions. 

1. E is weakly calibrated. 

2. Ties for a best response are broken with a deterministic, stationary rule. 

3. If p and q are in the same simplex (of the triangulation) then \p — q\ < e. 
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It is straightforward to see that the forecasting rule of player i, which is 
{Roundv{ft))-i, is calibrated regardless of how the other players behave. By 
the previous corollary the randomized scheme Roundvift) will be e-calibrated. 
Player i can then simply ignore the direction i of this forecast (by marginalizing) 
and hence has an e-calibrated forecast over the reduced space A-i- 

Thus, the rather minimal accuracy condition that players make calibrated 
predictions is satisfied, and, in this sense, it is rational for players to use the 
forecasts made by F. In fact, the setting of “publicly announced” forecasts is 
only one way to view the scheme. Alternatively, one could assume that each 
player has knowledge of the deterministic rule F and makes the computations of 
ft themselves. Furthermore, F only needs the history of play as an input (and 
does not need any knowledge of the players’ utility functions). 

It is useful to make the following definitions. Let Convex((5) be the set of 
all convex combinations of distributions in Q Define the distance between a 
distribution p and a set Q as: 



d{p, Q) = inf \p - q\ 
q&Q 

Using the result of Foster and Vohra [1997], it is immediate that the frequency 
of empirical play in the public learning process will (almost surely) converge 
into the set of 2e-correlated equilibria, since the players are making e-calibrated 
predictions, ie 

where CE 2 e is the set of 2e-correlated equilibria. Our second main result shows we 
can further restrict the convergent set to convex combinations of Nash equilibria, 
a potentially much smaller set than the set of correlated equilibria. 

Theorem 2. (Nash Convergence) The joint frequency of empirical play in the 
public learning process converges into the set of convex combinations of2e-Nash 
equilibria, ie with probability 1 

Convex{N E 2 e)^ 0 

as T goes to infinity. Furthermore, the rule F rarely uses forecasts that are not 
close to a 2e-Nash equilibrium — by this, we mean that with probability one 

1 ^ 

-J2d(ft, NE2e) ^ 0 

^ i=l 



as T goes to infinity. 

® If qi, 52 , . . . <2m G 0 then aiqi - 1-0252 . . . -fam5m G Convex(Q), where Oi are positive 
and sum to one. 
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Since our convergence is with respect to the joint empirical play, an imme- 
diate corollary is that the average payoff achieved by each player is at least 
the player’s payoff under some 2e-Nash equilibrium. Also, we have the following 
corollary showing convergence to NE. 

Corollary 2. If F is uniformly, weakly calibrated and if the triangulation V is 
made finer (le if e is decreased) sufficiently slowly, then the joint frequency of 
empirical play converges into the set of convex combinations of NE. 

As we stated in the Introduction, we argue that the above result deals with 
the coordination problem of “which Nash equilibrium to play?” in a sensible 
manner. Though the players cannot be pinned down to play any particular Nash 
equilibrium, they do jointly play some Nash equilibrium for long subsequences. 
Furthermore, it is public knowledge of which equilibrium is being played since 
the predictions ft are frequently close to some Nash equilibrium (not general 
combinations of them) . 

Now of course if the Nash equilibrium is unique, then the empirical 
play converges to it. This does not contradict the (impossibility) result of 
Hart and Mas-Colell [2003] — crudely, our learning setting keeps track of richer 
statistics from the history of play (which is not permitted in their setting). 



3.2 The Proof 

On some round in which / is forecasted, every player acts according to a fixed 
randomized rule. Let 7r(/) be this “play distribution” over joint actions A on any 
round with forecast /. More precisely, if ft is the forecast at time t, then iT{ft) 
is the expected value of Xt given ft. Clearly, 7r(/) is a product distribution since 
all players choose actions independently (since their randomization is private). 

Lemma 1. For all Lipschitz continuous test functions w, with probability 1, we 
have 

~ t=i 

as T tends to infinity. 

Proof. Consider the stochastic process Yr = y ~ ’’’(/*))• This 

is a martingale average (i.e. rlj- is a martingale), since at every round, the 
expected value of Xt is 7r(/t). By the martingale strong law we have Yr — >• 

0 as T tends to infinity, with probability one. Also, by calibration, we have 

1 X)t=i — At) — >■ 0 as r tends to infinity. Combining these two leads to 

the result. □ 

We now show that fixed points of tt are approximate Nash equilibria. 



Lemma 2. If f = 7r(/), then f is a 2e-Nash equilibrium. 
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Proof. Assume that a G Ai has positive probability under 7r(/). By definition of 
the public learning process, action a must be a best response to some distribution 
P-i, where p G V(f). Assumption 3 implies that \p — f\ < e, so it follows 
that \p-i — /_i| < e. Since the utility of taking a under any distribution q_i is 
'^a-iGA-i ®-i)> the previous inequality and boundedness of Ui by 

1 imply that a must be a 2e-best response to f-i. Furthermore, / is a product 
distribution, since 7 t(/) is one. The result follows. □ 

Taken together, these last two lemmas suggest that forecasts which are used 
often must be a 2e-Nash equilibrium — the first lemma suggests that forecasts 
/ which are used often must be equal to 7t(/), and the second lemma states that 
if this occurs, then / is a 2e Nash equilibrium. We now make this precise. 

Define a forecast / to be asymptotically unused if there exists a continuous 
test function w such that w{f) = 1 and 0- other words, a 

forecast is asymptotically unused if we can find some small neighborhood around 
it such that the limiting frequency of using a forecast in this neighborhood is 0. 

Lemma 3. If f is not a 2e-Nash equilibrium, then it is asymptotically unused, 
with probability one. 

Proof. Consider a sequence of ever finer balls around /, and associate a con- 
tinuous test function with each ball that is nonzero within the ball. Let ri, r 2 , 
ra, . . .be a sequence of decreasing radii such that — >■ 0 as i tends to infinity. 
Define the open ball Bi as the set of all points p such that \p— f\ < ri. Associate 
a continuous test function Wi with the z-th ball such that: if p ^ Bi, Wi{p) = 0 
and if p G Bi, Wi{p) > 0, with Wi{f) = 1. Clearly, this construction is possible. 

Define the radius r' as the maximal variation of tt within the the i — th ball, 
ie r'i = snpp ,j^g. K(p) — 7r(g)|. Since 7r(p) is continuous, then r' — >■ 0 as z tends 
to infinity. 

Using the fact that \f — 7t(/)| is a constant (for the following first equality). 



< 



< 



T 

'^m{ft){f - TT{f)) 

T 

'^w^{ft)({f - ft) - (7t(/) - 7r(/t)) -k {ft - 7r(/t))'j 
T T 

'^m{ft){f - ft) + '^m{ft){T<-{f) - 7T{ft)) 

T T 

(d + r'i)J2Mft) + J2m{ft){ft - 7r(/t)) 



J2wi{ft){ft - T^ift)) 
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where the last step uses the fact that Wi{ft) is zero if |/t — / | > rt {ie if ft ^ Bi) 
along with the definitions of rt and r'. 

Now to prove that / is asymptotically unused it suffices to show that there 
exists some i such that — >■ 0 as T tends to infinity. For a proof by 

contradiction, assume that such an i does not exist. Dividing the above equation 
by these sum weights, which are (asymptotically) nonzero by this assumption, 
we have 



|/-7r(/)| < r, + r' + 






Now by lemma 1, we know the numerator of the last term goes to 0. So, for all 
i, we have that \f — 7r(/)| < rt + r'. By taking the limit as i tends to infinity, 
we have \f — 7t(/)| = 0. Thus / is a 2e-Nash equilibrium by the previous lemma, 
which contradicts our assumption on /. □ 



We say a set of forecasts Q is asymptotically unused if there exists a contin- 
uous test function w such that w{f) = 1 for all f £ Q and 0- 

Lemma 4. If Q is a compact set of forecasts such that every f £ Q is not a 
2e-Nash equilibrium, then Q is asymptotically unused, with probability one. 



Proof. By the last lemma, we know that each g G Q is asymptotically unused. 
Let Wq be a test function which proves that q is asymptotically unused. Since 
Wq is continuous and Wq{q) = 1, there exists an open neighborhood around q in 
which Wq is strictly positive. Let N(q) be this open neighborhood. 

Clearly the set Q is covered by the (uncountable) union of all open neighbor- 
hoods N{q), ie Q C Uq^QN{q). Since Q is compact, every cover of Q by open 
sets has a finite subcover. In particular, there exists a finite sized set C C Q 
such that Q C UcecN(c). 

Let us define the test function w = We use this function to 

prove that Q is asymptotically unused (we modify it later to have value 1 on 
Q). This function is continuous, since each Wc is continuous. Also, w is non-zero 
for all g G Q. To see this, for every q £ Q there exists some c £ C such that 
g G N(c) since (7 is a cover, and this implies that Wc{q) > 0. Furthermore, for 
every c £ C, ^X)tli'*^c(/t) — >■ 0 with probability one and since \C\ is finite, we 
have that ^X)t=i^(/t) ^ with probability one. 

Since Q is compact, w takes on its minimum value on Q. Let a = 
minqgQw(g), so a > 0 since w is positive on Q. Hence, the function w{q)/a 
is at least 1 on Q. Now the function w'{q) = min{w(g)/a, 1} is continuous, one 
on Q, and with probability one, Therefore, w' proves that Q 

is asymptotically unused. □ 



It is now straightforward to prove theorem 2. We start by proving that 
NA 2 e) — >■ 0 with probability one. It suffices to prove that with 
probability one, for all <5 > 0 we have that 



}^'^l[d{NE2,,ft)>5 



-£ 0 
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where I is the indicator function. Let Qs be the set of q such that d{q, NE 2 e) > S. 
This set is compact, so each Qs is asymptotically unused. Let ws be the function 
which proves this. Since ws{ft) > I d{NE 2 e, ft) > S (with equality on Qs), the 
above claim follows since 0- 



Now let us prove that d Convex{N E 2 QJ — >■ 0 with probability 

one. First, note that calibration implies — >■ (just take w 

to be the constant test function to see this). Now the above statement directly 
implies that niust converge into the set C onvex{N E 2 e) ■ 



4 A Deterministically Calibrated Algorithm 



We now provide an algorithm that is uniformly, weakly calibrated for a con- 
structive proof of theorem 1. For technical reasons, it is simpler to allow our 
algorithm to make forecasts which are not valid probability distributions — the 
forecasts lie in the expanded set A, defined as: 

A = {/ : ^ f[k\ = 1 and f[k\ > -e} 
keo 

so clearly A C A, where A is the probability simplex in . We later show that 
we can run this algorithm and simply project its forecasts back onto A (which 
does not alter our convergence results). 

Similar to Subsection 2.2, consider a triangulation over this larger set A with 
vertex set V, and let V (p) be the corners of the simplex which contain p. It useful 
to make the following assumptions: 

1. If p, <7 are in the same simplex in the triangulation, \p — q\ < e. 

2. Associated with each v G V we have a test function Wy which satisfies: 

a) \iv (p), then Wy{p) = 0. 

b) For all p G A, ^vip) = 1 and Wv(p)v = p. 

3. For convenience, assume e is small enough (e < j|^) suffices) such that for 

all p,q G A, we have \p — q\ <3 (whereas for all p,q G A, \p — q\ < 2). 

In the first subsection, we present an algorithm. Forecast the Fixed Point, 
which (uniformly) drives the calibration error to 0 for those functions Wy. As 
advertised, the algorithm simply forecasts a fixed point of a particular function. 
It turns out that these fixed points can be computed efficiently (by tracking how 
the function changes at each timestep), but we do not discuss this here. The next 
subsection provides the analysis of this algorithm, which uses an “approachabil- 
ity” argument along with properties of the fixed point. Finally, we take e — >■ 0 
which drives the calibration error to 0 (at a bounded rate) for any Lipschitz 
continuous test function, thus proving uniform, weak calibration. 
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4.1 The Algorithm: Forecast the Fixed Point 

For notational convenience, we use instead of firiwy, X, F), ie 

1 ^ 

Mt(w) = - ft) 

^ t=i 

For V G V, define a function pt{v) which moves v along the direction of calibra- 
tion error pt{v), ie 

Pt{v) = v + pt{v) 

For an arbitrary point p G A, define prip) by interpolating on V. Since p = 
define pt{p) as: 

Pt(p) = '^v{p)pt{v) 

vev 

= P + X ! Wv{p)pt{v) 

v£V 

Clearly, this definition is consistent withj:he above when p G V. In the following 
section, we showjjhat px maps A into A, which allows us to prove that px has 
a fixed point in A (using Brouwer’s fixed point theorem). 

The algorithm. Forecast the Fixed Point, chooses a forecast f G A at time T 
which is any fixed point of the function px-i, ie'. 

1. At time T = 1, set po{v) = 0 for all v GV. 

2. At time T, compute a fixed point of px-i- 

3. Forecast this fixed point. 

4.2 The Analysis of This Algorithm 

First, let us prove the algorithm exists. 

Lemma 5. (Existence) For all X and T, a fixed point of px exists in A. Fur- 
thermore, the forecast fx at time T satisfies: 

Wv{fx)PT-l(v) = 0 

vev 

Proof. We use Brouwer’s fixed point theorem tojjrove existence, which involves 
proving that: 1) the mapping is into, ie px : A ^ A and 2) the mapping is 
continuous. First, let us show that px{v) G A for points v GV. We know 

1 ^ 

px{v) =v + ~ 

T \ T 

^ ~ ^ + V- ft) 
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It suffices to prove that Xt+v—ft is in A (when Wy{ft) > 0), since then the above 
would be in A (by the convexity of A). Note that Wy{ft) = 0 when \v — ft \ > e. 
Now if |w — /t| < e, then v — ft perturbs each component of Xt by at most e, 
which implies that Xt + v — ft € A since Xt G A. For general points p G A, the 
mapping pt(p) must also be in A, since the mapping is an interpolation. The 
mapping is also continuous since the WyS are continuous. Hence, a fixed point 
exists. The last equation follows by setting pT-iifr) = fr- □ 

Now let us bound the summed I 2 error, where ||x|| = ■ x. 

Lemma 6. (Error Bound) For any X , we have 

vev 

Proof. It is more convenient to work with the unnormalized quantity rT{v) = 
Tpt{v) = Y^=iWv{ft){Xt - ft). Note that 

lkr(u)|p = ||rT-i(f)|P + Wy{fT)‘^\\XT — /t|P + 2w„(/T)rT-i(u) • {Xt — fr) 

Summing the last term over V, we have 

Wy{fT)rT-l{v) ■ {Xt — /t) = T{Xt — fT) ■ ^ Wv{fT)pT-l{v) 
v£V vev 

= 0 

where we have used the fixed point condition of the previous lemma. Summing 
the middle term over V and using \ \Xt — /t|| < \Xt — fr \ < 3, we have: 

Wv{fT)‘^\\XT - /t|P < 9 ^ Wy{fT)'^ 

vev v£V 

< 9 ^ Wy{fT) 
vev 
= 9 

Using these bounds along with some recursion, we have 

lkT(^^)|P < ||rT-i(u)|P + 9 

vev vev 

< 9T 

The result follows by normalizing {ie by dividing the above by T^). □ 

4.3 Completing the Proof for Uniform, Weak Calibration 

Let g be an arbitrary Lipschitz function with Lipschitz parameter Xg, ie |(?(a) — 
5(^)1 < ~ ^1- We can use V to create an approximation of g as follows 

5(P) = X! 9{v)wy{p). 
vev 
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This is a good approximation in the sense that: 



\g{p) -g{p) \ < eAg 

which follows from the Lipschitz condition and the fact that p = '^v(p)v- 

Throughout this section we let F be “Forecast the Fixed Point”. Using the 
definition of prig, X, F) along with \Xt — /t| < 3, we have 



\PT{g.x,F)\ < 






+ 3eAg — \pT{g, + 3eAg 



Continuing and using our shorthand notation of pt(v), 



\Mg,x,F)\ = 



T 



EE g{v)wy{ft){Xt - ft) 






= '^g{'v)PT{Wv,X,F) 
vev 

< \pt{v)\ 



vev 



- a/I^I E \\i^t{v) 



vev 



where the first inequality follows from the fact that g{v) < 1, and the last from 
the Cauchy-Schwarz inequality. 

Using these inequalities along with lemma 6, we have 



\Mg,X,F)\<^l^+3eXg 

Thus, for any fixed g we can pick e small enough to kill off Ag This unfor- 
tunately implies that \V\ is large (since the vertex set size grow with 1/e). But 
we can make T large enough to kill off this \V\. To get convergence to precisely 
zero, we follow the usual approach of slowly tightening the parameters. This will 
be done in phases. Each phase will half the value of the target accuracy and will 
be long enough to cover the burn in part of the following phase (where error 
accrues). 

Our proof is essentially complete, except for the fact that the algorithm F 
described so far could sometimes forecast outside the simplex (with probabilities 
greater than 1 or less than zero). To avoid this, we can project a forecast in A 
onto the closest point in A. Let P(-) be such a projection operator. For any 
/ G Z\, we have \P{f) — /| < Thus, for any Lipschitz weighting function w 
we have 

Pt{w,X,PoF) = Y w{P{ft)){Xt - P{ft)) 
vev 
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= ^ w{p{mxt - /o + E - p(ft)) 

v&V v£V 

< ^t(w o P, X, F) + \i2\t 

Hence the projected version also converges to 0 as e — >■ 0 (since w o P is also 
Lipschitz continuous). Theorem 1 follows. 
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Abstract. We consider Reinforcement Learning for average reward 
zero-sum stochastic games. We present and analyze two algorithms. The 
first is based on relative Q-learning and the second on Q-learning for 
stochastic shortest path games. Convergence is proved using the ODE 
(Ordinary Differential Equation) method. We further discuss the case 
where not all the actions are played by the opponent with compara- 
ble frequencies and present an algorithm that converges to the optimal 
Q-function, given the observed play of the opponent. 



1 Introduction 

Since published in [DW92], the Q-learning algorithm was implemented in many 
applications and was analyzed in several different setups (e.g., [BT95,ABB01, 
BMOO]). The Q-learning algorithm for learning an optimal policy in Markov 
Decision Processes (MDPs) is a direct off-policy learning algorithm in which a 
Q- value vector is learned for every state and action. For the discounted case, the 
Q-value of a specific state-action pair represents the expected discounted utility 
if the action is chosen in the specific state and an optimal policy is then followed. 
In this work we deviate from the standard Q-learning scheme in two ways. First, 
we discuss games, rather than MDPs. Second, we consider the average reward 
criterion rather than discounted reward. 

Reinforcement learning for average reward MDPs was suggested in [Sch93] 
and further studied in [Sin94,Mah96]. Some analysis appeared later in [ABBOl, 
BT95]. The analysis for average reward is considerably more cumbersome than 
that of discounted reward, since the dynamic programming operator is no longer 
a contraction. There are several methods for average reward reinforcement learn- 
ing, including Q-learning ([ABBOl]), a polynomial PAC model-based learning 
model ([KS98]), actor critic ([KT03]), etc. Convergence proofs of Q-learning 
based algorithms for average reward typically rely on the ODE method and 
the fact that the Q-learning algorithm is essentially an asynchronous stochastic 
approximation algorithm. 

Q-learning for zero-sum stochastic games (SGs) was suggested in [Lit94] for 
discounted reward. The convergence proof of this algorithm appears, in a broader 
context, in [LS99]. The main difficulty in applying Q-learning to games is that 
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Q-learning is inherently an off-policy learning algorithm. This means that the 
optimal policy is learned while another policy is played. Moreover, the opponent 
may refrain from playing certain actions (or play them only a few times) so 
the model parameters may never be fully revealed. Consequently, every learning 
algorithm is doomed to learn a potentially inferior policy. On-policy algorithms, 
whose performance is measured according to the reward they accumulate may, 
however, attain an average reward which is close to the value of the game (e.g., 
[BT02]). We note two major difficulties with Q-learning style algorithms. First, 
one needs all actions in all states to be chosen infinitely often by both players 
(actually comparatively often for average reward). Second, the standard analysis 
of Q-learning (e.g., [Tsi94,BT95]) relies on contraction properties of the dynamic 
programming operator which follow easily for discounted reward or shortest path 
problems, but do not hold for average reward. We start by addressing the second 
issue and present two Q-learning type algorithms for SGs. We show that if all 
actions in all states are played comparatively often then convergence to the true 
Q-value is guaranteed. We then tackle the problem of exploration and show 
that by slightly modifying the Q-learning algorithm we can make sure that the 
Q- vector converges to the Q-vector of the observed game. 

The convergence analysis of the Q-learning algorithms is based on [BMOO, 
ABB02]. The main problem is the unfortunate fact that the dynamic program- 
ming operator of interest is not a contraction operator. In Section 3 we present a 
version of Relative Q-learning (e.g., [BS98]) adapted to average reward SGs. We 
later modify the A-SSP (Stochastic Shortest Path) formulation of [BT95, Sec- 
tion 7.1] to average reward SGs. The idea is to define a related SSPG (Stochastic 
Shortest Path Game) and show that by solving the SSPG the original average 
reward problem is solved as well. 

The paper is organized as follows: In Section 2 we define the stochastic game 
(SG) model, and recall some results from the theory of stochastic games. The 
relative Q-learning algorithm for average reward games is presented in Section 3. 
The A-SSPG algorithm is described in Section 4. Since the opponent may refrain 
from playing certain actions, the true Q-vector may be impossible to learn. We 
show how this can be corrected by concerning the observed game. This is done 
in Section 5. Brief concluding remarks are drawn in Section 6. The convergence 
proofs of both algorithms are deferred to the appendix. 

2 Model and Preliminaries 

In this section we formally define SGs. We then state a stability assumption 
which is needed in order to guarantee that our analysis holds and that the value 
is independent of the initial state. We finally survey some known results from 
the theory of SGs. 

2.1 Model 

We consider an average reward zero-sum finite (states and action) SG which is 
played ad-infinitum. We refer to the players as PI (the decision maker in interest) 
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and P2 (the adversary). The game is defined by the five-tuple (S,A,B,P,r), 
where: 

1. S is the finite set of states of the stochastic game, 5 = {1, . . . , S}. 

2. A is the set of actions of PI in each state, A= {1, . . . , A}. To streamline the 
notations it is assumed that in all states PI has the same available actions. 

3. B is the set of actions of P2 in each state, B = {1, . . . , B}. It is assumed that 
in all states P2 has the same available actions. 

4. P is the conditional transition law. P: SxAxBxS^ [0,1] such that 
P{s'\s,a,b) is the probability that the next state is s' given that current 
state is s, PI plays a, and P2 plays b. 

5. r is Pi’s (random) reward function, r:5x^x,BM'IR. The reward obtained 
when PI plays a, P2 plays 6, and the current state is s is distributed according 
to a measure ^(s,a,b) whose mean is r{s,a,b)- A bounded second moment 
is assumed. 

At each time epoch n, both players observe the current state s„, and then PI 
and P2 choose actions o„ and 6„, respectively. As a result PI receives a reward of 
r„ which is distributed according to ii{sn,an,bn)- The next state is determined 
according to the transition probability P(-|s„, a„, 6„). A policy cti € for PI is 
a mapping from all possible histories (including states, actions, and rewards) to 
the set of mixed actions A{A), where A{A) is the set of all probability measures 
over A. Similarly, a policy <T 2 G B 2 for P2 is a mapping from all possible histories 
to the mixed actions A{B). A policy of either player is called stationary if the 
mixed action in time n depends only on the state s„. Let the average reward at 

time n be denoted by = X)”=i fr/n. 



2.2 A Stability Assumption 

We shall make the following assumption throughout the paper. The assumption 
can be thought of as a stability or recurrence assumption. The state s* is a refer- 
ence state to which a return is guaranteed. Recall that a state is recurrent under 
a certain pair of policies of PI and P2 if that state is visited with probability 1 
in finite time when the players follow their policies. 

Assumption 1 (Recurrent State). There exists a state s* G S which is re- 
current for every pair of stationary strategies played by PI and P2. 

We say that an SG has a value v if 

V = sup inf lim inf Ed , 0-2 [vn] = inf sup lim sup Ed , 0-2 [fn] • 

(J^ (72 n — >-00 <72 n—¥oo 



For finite games, the value exists ([MN81]). If Assumption 1 holds, then the value 
is independent of the initial state and can be achieved in stationary strategies 
(e.g., [FV96]). 
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2.3 Average Reward Zero-Sum Stochastic Games Background 

We now recall some known results from the theory of average reward scalar 
games. We assume henceforth that Assumption 1 is satisfied. For such games it 
is known (e.g., [FV96, Theorem 5.3.3]) that there is a value and a bias vector, 
that is there exists a number v and a vector H G M'* such that for each s G S: 

ff(s) + v = val 

a,b 

where vala,& is the minimax operator, which is defined for a matrix R with A 

rows and B columns as valo.b[i?] = inf„gzi(A) sup„g^(s) X)a=i 
Furthermore, in [Pat97, page 90] it was shown that under Assumption 1 there 
exists a unique H such that Equation (2.1) holds for every s G S and for some 
specific s' we have that H{s') = 0. We note that when the game parameters are 
known there are efficient methods to compute H and v; see [FV96,Pat97]. It is 
often convenient to use operator notations. In this case the resulting (vector) 
equation is: 

ve + H*=TH*, (2.2) 

where e G IR^ is the ones vector (e = (!,...,!)) and T : IR‘® i— >■ IR'^ is the 
dynamic programming operator defined by: 

TH(s) = val 

a,b 

It turns out that T is not a contraction, so that Q-learning style mechanisms 
that rely on contraction properties may not converge. Thus, a refined scheme 
should be developed. Note that if H* is a solution of (2.2) so is H*+ce, so that one 
must take into account the non uniqueness of the solutions of (2.2). We propose 
two different schemes to overcome this non-uniqueness. The first scheme is based 
on the uniqueness of the solution of Equation (2.2) that satisfies H{s*) = 0, and 
the second is based on a contraction property of a related dynamic programming 
operator (for an associated stochastic shortest path game). 

Our goal is to find the optimal Q- vector which satisfies that: Q*{s,a,b) = 
r(s, a, b) + Y^^, P{s'\s, a, b)H*{s'), where H* is a solution of the optimality equa- 
tion (2.2). Note that if H* is determined uniquely (by requiring H*{s*) = 0) 
then Q* is also unique. The Q- vector is defined on IR'®'"^'^, the interpretation of 
Q{s, a, b) is the relative gain for PI to use action a assuming P2 will use action 
b, when current state is s. Given the vector Q*, the maximin policy is to play at 
state s a maximin (mixed) action with respect to the matrix game whose entries 
are Q{s,-,-). 



-s'GS 



r{s, a, b) + P(s'js, a, b)H(s') 



(2.3) 



r{s,a,b)+J2 P(s'js, a, b)H{s) 

s'GS 



( 2 . 1 ) 



3 Relative Q-learning 

Relative Q-learning for average reward MDPs was suggested by [Sch93], and 
studied later in [Sin94,Mah96] . It is the simulation counterpart of the relative 
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value iteration algorithm (e.g., [Put94]) for solving average reward MDPs. The 
following algorithm is the SG (asynchronous) version of the relative Q-learning 
algorithm. 

Q„+i(s, a, b) = Qn{s, a, b) + l{s„=s,a„=a,b„=b}li.N {n, s, a, b)) (^r„ + 

FQ„(s„+i) - f(Qn) - Qn{s, a, b)J , (3.4) 

where N(n,s,a,b) denote the number of times that state s and actions a and 
b were played up to time n (i.e., N{n,s,a,b) = X)r=i l{sx=s,o^=a, 6 ^=b}), and 

F : !-->■ is the per state value function which satisfies: FQ{s) = 

valo,b[Q(s, a, &)]. The function f{Q) : — >■ IR is required to have the fol- 

lowing properties: 1. / is Lipschitz; 2. / is scaling invariant - f{aQ) = af{Q); 
3. / is translation invariant — f{Q + er) = f{Q) + r where e is the vector of ones 
(note the abuse of notations - e is RSA dimensional here). Examples for valid 
/’s are /(Q) = Q(s°,a°,6°) for some (s°,a°,&°) or /(Q) = 535 Es.a.b ^)- 

Intuitively, / takes care of having the Q-vector bounded. More precisely, we shall 
use / in the proof to ensure that the underlying ODE has a unique solution. 

We require the standard stochastic approximation assumption on the learning 
rate 7 . Namely, 7 should be square summable but not summable, and “regular” 
in the sense that is does not vanish occasionally. More precisely: 

Assumption 2 (Learning Rate). The sequence 7 ( 71 ) satisfies:^ 

1. For every 0 < a; < 1, sup^, 7 ([a;fcJ )/ 7 (/c) < 00 . 

E“=i l{n) = 00 and Ylu=i < 00 . 

3. For every 0 < er < 1 the limit (X)m”l 7(’^))/(X)m=i 7(’^)) 1 uniformly 

in y & [x, 1]. 

For example, 7 ( 71 ) = 1/n and 1/nlogn (n > 1) satisfy this assumption. The 
following assumption is crucial in analyzing the asynchronous stochastic ap- 
proximation algorithm. 

Assumption 3 (Often updates). There exists a deterministic S > 0 such that 
for every s € S,a € A,b € B, liminf„_>oo ^ probability 1. That 

is, all component are updated comparatively often. 

The following theorem is proved in Appendix A.l. 

Theorem 1. Suppose that Assumptions 1, 2 and 3 hold. Then the asynchronous 
algorithm (3.4) converges with probability 1 to Q* . 

4 A-SSPG Q-learning 

A different approach is to use the A-SSP (Stochastic Shortest Path) formulation, 
suggested by Bertsekas and Tsitsiklis [BT95, Section 7.1] for average reward 



^ [a;J is the integer part of x. 
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MDPs and analyzed in [ABBOl]. The key idea is to view the average reward 
as the ratio of the expected total reward between renewals and the expected 
time between renewals. We consider a similar approach for SGs, using results 
from [Pat97] regarding SSPGs. From the stochastic approximation point of view 
we maintain two time scales. We iterate the average reward estimate, A, slowly 
towards the value of the game, while the Q-vector is iterated on a faster scale 
so that it tracks the Q-vector of the associated SSPG. The convergence follows 
from Borkar’s two-time-scale stochastic approximation ([Bor97]). There are two 
equations that are iterated simultaneously, the first is related to the Q-vector, 
is defined as a vector in and the second is related to A which is a real 

number. The A-SSPG Q-learning algorithm is: 

Qn-t-l (^ 7 ^ 5 ^) — 6) -f y( A^(7r, s, n, 6) ) T (^n+l) ^{sn+l^s* } 

a, 6 ))^ ^[sn—s,an—a,bn—b} 

Xn +1 = A{Xn + b{n)FQ„{s*)) , (4.5) 

where b{n) = 0 ( 7 ( 71 )), A is the projection to the interval [—K,K] chosen such 
that |v| < K, and N{n,s,a,b) and F are as before. 

An additional assumption we require is that all the elements are sampled in 
an evenly distributed manner. More precisely: 

Assumption 4. For every x > 0 let M{n, x) = min{m > n : xW ^ 2 ;} , 

for every s, s' € S, a, a' G A, b,b' G B the limit: 

Y^N(M{n,x),s,a,b) , 

^k—N{n,s,a,b) i\^) 

Y^N{M{n,x),s',a',b') /. n 

2-^k—N{n,s',a',b') / 



exists almost surely. 

The following theorem is proved in Appendix A. 2. 

Theorem 2 . Suppose that Assumptions 1, 2, 3, and 4 hold. Further, assume 
that b{n) satisfies Assumption 2 and that b{n) = 0 ( 7 ( 77 )). Then the asynchronous 
algorithm (4-5) converges with probability 1 so that Qn -G Q* and A„ -G v. 

5 The Often Update Requirement 

The convergence of both algorithms described in the previous sections required 
several assumptions. Assumption 1 is a property of the (unknown) game. As- 
sumption 2 is controlled by Pi’s choice of the learning rate and can be easily 
satisfied. Assumption 3 (and 4 for the second algorithm) presents an additional 
difficulty. The often updates requirement restricts not only on Pi’s policy but 
also P2’s actual play. Obviously, PI cannot impose on P2 to perform certain 
actions and consequently we cannot guarantee that Q* ■ In this section 

we consider methods to relax the often updates assumption. We will suggest 
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a modification of the relative Q-learning algorithm to accommodate for state- 
action-action triplets that are not played comparatively often. 

If certain state-action-action triplets are performed finitely often their Q- 
values cannot be learned (since even the estimation of the immediate reward is 
not consistent). We therefore must restrict the attention of the learning algorithm 
to Q- value of triplets that are played infinitely often, and make sure that the 
triplets that are not played often do not interfere with the estimation of the 
Q- value of the other triplets. The main problem is that we do not know (at 
any given time) if an action will be chosen finitely often (and can be ignored) or 
comparatively often (and should be used in the Q-learning) . We therefore suggest 
to maintain a set of triplets that have been played often enough, and essentially 
learn only on this set. Let Yn{5) denote the set of triplets that were sampled more 

than 6 fraction of the time up to time n, that is: F„(5) = {(s, a, 6) € S x A~x B ■. 
N(n,^,a,b) ^ algorithm we suggest is the following modification of (3.4): 



Qn+i{s,a,b) = { 



^^Sfi=s,an=a,bn=b}^^^ {,^7 ^7 ^7 ^)) “f .^Qn(^n-t-l) 

-fiQn) - Qn{s,a, b)^ 



if (s,a,b) e Yn{S) 

-M if (s,a,6) ^y„(<5) 

(5.6) 

where M is a large positive number which is larger than maxg^a^b |Q(s> a, ^)|- Let 



Too (i5) = {(S) 0 , 6) G S X A X B : lim inf — ^ ^ 

n n 



denote the set of triplets that are chosen comparatively often (<5 is a deterministic 
constant). We refer to the game which is restricted to triplets in Too(i5) as the 
(5-observed game. We denote the solution of Bellman’s equation (2.3) where the 
a, b entry for all the triplets not in Too(<5) is replaced by —M (and are therefore 
not relevant to the optimal policy) by Hy^ and the matching Q-vector by Qy^ ■ 

Theorem 3. Suppose that Assumptions 1 and 2 hold, and suppose that for every 
state- action- action triplet (s,a,b) we have that: 



N{n,s,a,b) N{n,s,a,b) 

lim inf >0 or lim sup < 0 . 



Then (5.6) converges with probability one so that Qn{s,a,b) — >■ Qy for every 
{s,a,b) G Too(i5). 

Proof. For every triplet (s,a,b) in yoo(^) there exists a time r(s,a, 6) such that 
for every n > T(s,a,b) the triplet (s,a,b) G T„((5). By the condition in the the- 
orem if (s,a,b) ^ Foo((5) then there exists a time T'{s,a,b) such that for every 
n > t'{s, a, b) the triplet (s, a, b) ^ T„((5). Let t be the time after which T„(5) is 
fixed, i.e., t = max{max(^_a.6)Gyoo(5) '^'(s, a, &)}. Sup- 
pose now that the learning algorithm begins at time r. Since r is finite it is easy 
to see that Assumptions 1-3 are satisfied restricted to (s,a,b) G Too (5) so that 
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by Theorem 1 the result follows. Note that the triplets which are not in yoo(^) 
are updated every epoch (after t) with the value —M. □ 

Naturally, some actions may satisfy neither the liminf condition nor the 
limsup conditions. A method that controls S dynamically, and allows to circum- 
vent this problem is under current study. 

6 Concluding Remarks 

We presented two Q-learning style algorithms for average reward zero-sum SGs. 
Under appropriate recurrence and often updates assumptions the convergence 
to the optimal policy was established. Our results generalize the discounted case 
that was proved in [LS99]. There are several open questions that warrants fur- 
ther study. First, the extension of the results presented in this paper to games 
with a large state space, where function approximation is needed, appears non- 
trivial. Second, we only partially addressed the issue of actions that are not 
chosen comparatively often by the Q-learning algorithm. There are several other 
possibilities that can be considered (using a promotion function as in [EDMOl], 
adding bias factor as in [LR85], and optimistic initial conditions as in [BT02]) 
none have proved a panacea for the complications introduced by “uneven” ex- 
ploration. Third, we only considered zero-sum games. Extending the algorithms 
presented here to general sum games appears difficult (even the extension for 
discounted reward is a daunting task). Finally, universal consistency in SGs (e.g., 
[MS03]) is a related challenging problem. In this setup PI tries to attain an aver- 
age reward which is as high as the average reward that could have been attained 
had P2’s strategy (or some statistical measure thereof) was provided in advance. 
The definitions for universal consistency in SGs are involved and the strategies 
suggested to date are highly complex. Devising a simple algorithm in the style of 
Q-learning is of great interest. We note, however, that the distinctive property 
of universal consistency is that P2’s strategy cannot be assumed stationary, so 
stochastic approximation algorithms which rely on stationarity may not work. 
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A Appendix 

In this appendix we provide convergence proofs of the two learning algorithms 
presented above. We start by from the Relative Q-learning algorithm and then 
turn to the A-SSPG Q-learning algorithm. In both cases we also discuss the syn- 
chronous algorithm where it is assumed that all the state-action-action triplets 
are sampled simultaneously in every iteration. Much of the derivation here relies 
on [ABBOl] and [BMOO]. 

A.l Proof of Theorem 1 

We start with defining a synchronous version of (3.4). 

Q„+i(s, a, b) = Qrr{s, a, b) + 7 (n) {r^{s, a, b) -|- FQ„{^{s, a, b)) - f{Qn) - Qn{s, a, b)) 

(A.7) 

where ^(s, a, 6) and r^(s,a, 6) are the independently simulated random values 
of the next state and the immediate reward assuming s„ = s, a„ = a, and 
bn = b, respectively. The above algorithm is the off-policy version of relative 
value iteration for average reward games. 

Let us refer to Equation (A.7). In order to use the ODE method of [BMOO] 
we first reformulate the synchronous Relative Q-learning iteration as a vector 
iterative equation: 

Qn+l = Qn + l{n){TQn ~ f{Qn)^ ~ Qn + Af„+i), 

where: 1. TQ is the operator T : i— >■ that is defined by: 

TQ{s,a,b) = f(Q) is a relative func- 

tion as defined previously; and Mn+i is the “random” part of the iteration: 
M„+i(s, a, b) = r^(s, a, b) + a, b)) — TQ„(s, a, b). Denoting the a-algebra 

until time n by < n) it follows that for all n, under 

the assumption that all random variables are bounded: IE(M„+i|iF„) = 0 and 
IE(||M„_|_i|p|.7f„) < C(1 -I- IIQniP) for some constant C. We follow the analysis 
made by [ABBOl] for the rest of the section. 

Let us define the following operators T' : IR'^'^''® — >■ IR'^'^'^ and T : 
jrS-a.b ^ A = TQ- f{Q)e, where v is 

the value of the game. In order to apply the ODE method we need to prove that 
the following ODE is asymptotically stable: 

Q{t) = T'{Q{t)) - Q{t) . (A.8) 

The operator T' is not a contraction, furthermore, it is not even non-expansive. 
We therefore establish its stability directly by considering the following ODE: 

Q{t) = f{Q{t)) - Q{t) . (A.9) 

The following lemmas establish the properties of the operators. 

Lemma 1. The operator T is sup norm non-expansive 
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Proof. Recall that TQ{s, a, b) = X^s'eS b){r{s, a, h) + FQ{s')). Fix Qi 

and Q 2 , the sup norm of the difference, ||T((5i) — T(Q 2 )||oo is achieved by some 
element (s, a, 6). 

l|F(Qi) - F(Q 2 )||oo = I P(s'|s,a,6)( max min Qi{s ,a ,h')va' {s)) 

' \veA(A) ij,GA(B) ' 

s'€.S a' ,b' 

— max min > Q 2 (s'a\b')Va'(s')uy(s')] . 

v€A(A) ^l€A(B) ^ J 

a' ,h' 

Assume without loss of generality that the sum inside the absolute value is 
positive. For every s' fix v(s') which is a max-min strategy and first element is 
maximized (the element that relates to Qi). Similarly, fix fj,{s') for the second 
element which is a min-max strategy of P2 for each game defined by the second 
element for every s'. By the min-max theorem the first element cannot decrease 
and the second cannot increase. Since for every element s' the difference may 
only increase we have that: 

r(Qi)-T(g2)iioo<|E P(s'|s, a, b) 

s'^S a' b' 

Q2{s',a',b'))va'{s')fib'{s')"j . 

But this is a convex combination of elements of Qi — Q 2 and is certainly not 
more than the sup norm of the difference. □ 

Corollary 1. T is sup norm non-expansive. 

Proof ||f (Qi) - f (Q 2 )||oo = imOl) - T(Q 2 )||oo < IIQl - g 2 ||oo □ 

Let us denote the span semi-norm by || • ||s. That is HQIU = 
maxg^o^b g(s, a, b) - ming,o_b Q{s, a, b). 

Lemma 2. The operator T is span semi-norm non-expansive. 

Proof. 



\\TQi - TQ2IU = max{TQi(s, a, b) - TQ2{s,a, &)} ~ ™in {TQi(s', a', b') - TQ2{s', a', b')}. 

s,a,b s' ,a' ,b' 

There exist (s, d, b) and (s, a, b) that achieve the maximum and minimum of the 
span semi-norm, respectively. By writing the operator T explicitly and cancelling 
the reward elements: 

||TQi - TQ2 ||s = P{.s'\s,a,b)( max min y^ Qi{s ,a ,b')Va’ W ~ 

A — ' \veA(A) fi£A(B) A — / 



max min > Q2{s ,a ,b')va' Pb' 

A(B\ -V V , , / r~ 



veA(A) )j,eA(B) 



a',b' 



E P(s'\s,a,b){ max min > Qi(s' ,a ,b')Va' Uy 

\veA(A) ,xeA(B) 

s' a' ,b' 

— max min y Q2(s.a.b')Va'Ub'\- 
v^A{A) ^ J 



a' ,b' 
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For every s there are four min-max operation in the above, lets us denote 
the maximizing strategy for Pi’s of the t-th item for state s by u*(s) and the 
minimizing strategy for P2’s of the i-th item for state s by For every s 

fix w^(s) as Pi’s strategy for the two first elements and /r^(s) as P2’s strategy 
for P2 strategy for the two first elements. The sum of the first two elements 
can only increase, as the first element cannot decrease and the second cannot 
increase. Similarly, for every s fix for the third and fourth elements. Pi’s strategy 
to be w‘*(s) and P2’s strategy to be /i^(s). The difference between the third 
and fourth elements can only increase, thus the total difference increases. We 
therefore obtain that ||T(5i — TQ 2 \\s can be bounded by a convex combination 
of Qi — Q 2 , which is certainly not greater than the span semi-norm. □ 

Corollary 2. T' and T are span semi-norm non- expansive. 

Denote the set of equilibrium points of the ODE (A. 9) by G, that is G = 
{Q : TQ = Q - ve}. 

Lemma 3. G is of the form Q* -\- ce. 

Proof. First note that for every c G IR and Q € we have T(Q -j- ce) = 

TQ-\-ce (e is now an SAB dimensional vector of ones). Also note that F{Q-\-ce) = 
FQ-\-ce as equality in IR'^. Activate the operator F on the equation TQ = Q—ve, 
so that for Q G G we have that FTQ = FQ — ce. Under Assumption 1 we can 
apply Proposition 5.1 from [Pat97]. According to this proposition there exists a 
unique solution to the equation TiJ = Ff — ve up to an additive constant. □ 

Theorem 4. Q* is the globally asymptotically stable equilibrium point for (A. 8) 

Proof. This is proved by direct computation using the above lemmas, and the 
Lipschitz continuity of /. We omit the details as the proof follows [ABBOl, 
Theorem 3.4]. □ 

We use the formulation of [BMOO] for establishing convergence for the syn- 
chronous and the asynchronous cases. For the synchronous case we only need 
the stability assumption and the standard stochastic approximation assumption 
on the learning rate. 

Theorem 5. Under Assumptions 1 and 2 the synchronous algorithm (A. 7) con- 
verges to Q* almost surely. 

Proof. We apply Theorem 2.2 from [BMOO] to show the boundedness of and 

to prove convergence to Q* . As in [BMOO], let h{x) = T{Q)—Q—f{Q)e. The ODE 
x{t) = h{x{t)) has a globally stable solution by Theorem 4. Since f{aQ) = af{Q) 

it follows that the limit hao = liniz_>oo h{zx)jz exists and is simply the operator 
T with the payoffs r(s, a, b) set to zero for all s, a, b. According to Theorem 4 the 
origin is asymptotically stable since the theorem can be applied to the game with 
zero payoffs. The other assumptions of Theorem 2.2 from [BMOO] are satisfied 
by construction. □ 
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The asynchronous algorithm converges under the appropriate assumptions. 
Proof of Theorem 1: This is a corollary of Theorem 2.5 in [BMOO], the con- 
dition are satisfied as proved for the synchronous case. □ 

A critical component in the proof is the boundedness of Q„. We used the 
method of [BMOO], however, one can show it directly as in [ABBOl, Theorem 
3.5]. By showing the boundedness directly a somewhat weaker assumption on / 
can be made, namely that \f{Q)\ < ||Q||oo instead of f{aQ) = af{Q). 

A. 2 Proof of Theorem 2 

We associate with the average reward game an SSPG parameterized by A G IR. 
This SSPG has a similar state space, reward function, and conditional transition 
probability to the average reward game. The only difference is that s* becomes 
an absorbing state with zero-reward, and the reward in all other states is reduced 
by A. Let V\ denote the value function of the associated SSPG which is given as 
the unique (by Proposition 4.1 from [Pat97]) solution of: 



Pa(s) = val 

a,b 

Va(s*) = 0. 



r{s,a,b) + ^ P{s'\s,a,b)Vx{s') - A 

s'G5 



s yf s*,(A.10) 



If A = V we retrieve the Bellman equation (2.2) for average reward. Let us first 
consider the synchronous version of (4.5): 

Qn+i{s,a,b) = Qn{s,a,b) + j{n){r^{s,a,b) + FQn{{^n{s,a,b))l{^^(^s,a,b)^s»} 
\n+l = \n-\-b{n)FQn{s*) , (A. 11) 



where we require that h{n) = 0 ( 7 ( 71 )) and ^ and are as before. The prob- 
lem with using the ODE method directly is that A„ may be unbounded. As 
in [ABBOl], this can be solved using the projection method (e.g., [KY97]) by 
replacing the iteration of A by: A„+i = A(A„ -I- b{n)FQ{s *)) , where yl(-) is pro- 
jection onto the interval [—K^K] with K chosen so v G \—K,K]. The following 
relies on a two time scale analysis as suggested in [Bor97]. The analysis closely 
follows Section 4 of [ABBOl]. The limiting ODE of the iteration (A. 10) assuming 
that b{n) = 0 ( 7 ( 77 )) is: 

Q(t) = r'(Q(t),A(f))-Q(t), A(t) = 0, 



where T'(Q, A) is s T{Q) — Ae. Thus, it suffices to prove that the following 
equation: 

Q{t)=T'{Q{t),\)-Q{t) (A.12) 

is asymptotically stable equation for a fixed A. The stability can be deduced 
from the fact that T is a weighted maximum norm contraction as the following 
lemma proves. Recall that a weighted maximum norm with weights norm w in 
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is defined as: ||x||u, = maxi<i<ci \xi\wi. A policy is called proper in an SSPG 
if its total reward is finite (almost surely for every policy of the opponent). 
Assumption 1 implies that all policies are proper in the associated SSPG. 

Lemma 4. Assume that all the policies are proper in an SSPG. Then the oper- 
ator T{Q) is a weighted maximum norm contraction 

Proof. We define a stochastic shortest path (SSP) problem where both play- 
ers cooperate in trying to minimize the time of arrival to the absorbing state. 
Using the solution to this problem we bound the difference between Q-vectors 
when the players do not cooperate. Define a new single player SSP ([BT95], 
Section 2.3) where all the rewards are set to —1 (except for s* which is zero 
reward) and the transition probabilities are unchanged. The two players are 
allowed to cooperate. By [BT95], there exists an optimal reward J and sta- 
tionary policies jj, G A{A)^ for PI and v G A{B)^ for P2 such that the opti- 
mal time of arrival to the absorbing state is minimal. The vector Q is defined 
as: Q{s,a,b) = X)s' Bellman’s equation for that SSP is: 
Q{s,a,b) = -1 + iOg/ f’(s'|s, a,6) X)a',6' Aa'(s')f'b'(s')Q(s',a',&')- Moreover, for 
any p, and jz we have (P(s'|s, p, v) is the transition matrix assuming p and v are 
played), in vector notations: Q < — le -I- P(s'|s, p, u)Q. that is: 

-^P(s'|s,a',&') ^ Pa' (s')iyb' {s')Q{s' , a' , b') < -Q-1 < Q{s,a,b)ct, (A. 13) 

s' a' ,b' 



where a = m.a,Xs^a,b{Q{s, a, 6) -I- l)/(Q(s, a, b)), since Q < — 1 we have a G [0, 1). 
We now show that a is the contraction factor for the weighted max norm which 

vector is w = —Q. 

Resume the discussion of the original SSPG, let Q and Q be elements such that 
\\Q ~ QW’w = c. Let p G A{A)^ be a policy such that = TQ (maximizing 
policy), where: 



T^Q{s,a,b) = Y.P{s'\s,a,b) 

s' 



^r(s,a,6) 



min 

u€A{B) 



E 

a' ,b' 



)'^b' : 




Let ly G A{B)^ be a policy for P2 such that = T^^{Q) (min- 
imizing policy for P2) where T^j/(s, a, &) = 'YfsiP{s'\s,a,b){f’{s,a,b) + 

Ha' ,b' Ba'{s')vb’{s')Q{.s' ,a' A')) ■ K follows then: TQ - TQ = T^Q - TQ < 
T/J.Q - Tfj^Q = T^Q - T^aQ < Tfj^uQ - T^ijQ. The inequalities follow by im- 
posing on the minimizer and the maximizer policies that might be suboptimal. 
We therefore have that for every s, a, b 



TQ{s, a, b) - TQ{s, a, b) 
cw{s, a, b) 



^ cw{l,a,b) Hs' Ha',b' P{s'\s,a',b')Ta'{s')nb'{s') 
{Q{s',a',b') - Q{s',a',b')) , 




Reinforcement Learning for Average Reward Zero-Sum Games 



63 



since HQ — = c we have Q — Q < cw (as a vector inequality) and therefore: 

^ {P{s'\s,a\b')^^As>As>is'y,b')) . 

Plugging in u; as defined above: 



TQ(s, a, b) - TQ{s, a, b) 
cw{s, a, b) 



-Q{s,a,b) ^ ^ 



EE^( s'|s, o', b')fla' (s')Vb' (s') (-Q(s', o', 6')) . 



Finally, using the previous argument regarding the minimality of fx and v and 
(A. 13) we have 



TQ(s, a, h) - TQ{s, a, b) 
cw{s, a, b) 



-=- -(-Q(s, a, b))a = a <1. 

Q(s,a,b) 



□ 

Let Q*{X) be the Q-vector that appears in each entry of (A. 10). Adapting 
the arguments of [BS99] and using the fact that T'(-, A) is a weighted maximum 
norm contraction we can deduce that: 



Lemma 5. The globally asymptotically stable equilibrium for (A. 12) is Q*(A). 
Furthermore, every solution of the ODE (A. 12) satisfies that ||Q(t) — Q*(A)||uj — >■ 
0 monotonically. 

In order to use two time scale stochastic approximation (e.g., [Bor97]) con- 
vergence theorem we need to establish the boundedness of Q: 

Lemma 6. Q„ remains bounded almost surely for both the synchronous (A. 11) 
and asynchronous (4-5) iterations. 

Proof. According to Lemma 4 we have: ||T(Q)||u, < a||(5||u, + D. Since A is 
bounded by K there exists some D such that ||T'(Q, A) ||u, < a\\Q\\w + D + K. If 
IIQIU > 2/{l-a){D+K) we have \\TQ\U < a\\Q\U + D + K < (l/2+a/2)||Q|U 
and therefore for Q whose norm is large enough the iteration reduces the norm. 
The asynchronous case follows in a similar manner to [BT95, Section 2.3]. □ 

A convergence theorem can finally be proved in a similar manner to [ABBOlj. 



Theorem 6. Suppose that Assumptions 1 and 2 hold. Then the synchronous 
X-SSPG Q-learning algorithm (A. 11) satisfies that (Qn,A„) — >■ (Q*,v) almost 
surely. 

Proof. The assumptions needed for Theorem 1.1 in [Bor97] are satisfied by con- 
struction. By definition A„ is bounded. The vector Q„ is bounded by Lemma 6. 
Since T' is continuous w.r.t. A and using the stability of the underlying ODE 
(Lemma 5) we have ensured convergence to the appropriate limit. The only dif- 
ference from Theorem 4.5 in [ABBOl] is that we need to make sure that the 
slope of the mapping A — >■ Q*(A) is finite. But this was shown by Lemma 5.1 of 
[Pat97]. □ 

For the asynchronous case the same can be proved. 

Proof of Theorem 2: The analysis of [Bor98,KB00] applies since boundedness 
holds by Lemma 6. The only difference from Theorem 6 is that a time scaled 
version is used. □ 
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Abstract. We give the first polynomial time prediction strategy for any 
PAC-learnable class C that probabilistically predicts the target with mis- 
take probability 

poly{log{t)) ^ ^ 

where t is the number of trials. The lower bound for the mistake proba- 
bility is [HLW94] n{l/t) , so our algorithm is almost optimal.^ 



1 Introduction 

In the Probabilistic Prediction model [HLW94] a teacher chooses a boolean func- 
tion / : X — >■ {0, 1} from some class of functions C and a distribution D on X. 
At trial t the learner receives from the teacher a point Xt chosen from X accord- 
ing to the distribution D and is asked to predict f(xt). The learner uses some 
prediction strategy S (algorithm), predicts S{xt) and sends it to the teacher. The 
teacher then answers “correct” if the prediction is correct, i.e. if S{xt) = f{xt) 
and answers “mistake” otherwise. The goal of the learner is to run in polynomial 
time at each trial (polynomial in logt and some measures of the class and the 
target) minimize the worst case (over all f G C and D) probability of mistake 
in predicting f{xt). 

Haussler et. al. in [HLW94] gave a double exponential time prediction strat- 
egy (exponential in the number of trials t) that achieves mistake probability 
Vc/t = 0(1/ 1) where Vq is the VC-dimension of the class C. They also show 
a lower bound of f2{Vcjt) for the mistake probability. They then gave an ex- 
ponential time algorithm (polynomial in t) that achieves mistakes probability 
{Vc /t)\og{t/Vc) = 0{\ogt/t) assuming that C is PAC-learnable in polynomial 
time. Since learning in the probabilistic model implies learning in the PAC model, 
the requirement that C is efficiently PAC-learnable is necessary for efficient prob- 
abilistic prediction. The results from [BG02] gives a randomized strategy that 
achieves mistake probability exponentially small in the number of mistakes. 

* This research was supported by the fund for promotion of research at the Technion. 
Research no. 120-025. 

^ The lower bound proved in [HLW94] is f2(Vc/t) where Vc is the VC-dimension of 
the class C. In our case Vc is fixed and therefore is 0(1) with respect to t 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 64—76, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In this paper we give an algorithm that generate a deterministic prediction 
strategy S. We show that if C is PAC-learnable then there is deterministic pre- 
diction strategy that runs in polynomial time and achieves mistake probability 
at most 

polyjlog t) ^ Q 

t ■ 

This is the first prediction strategy that runs in polynomial time and achieves 
an almost optimal mistake probability. 

Our algorithm is based on building a new booster for the PAExact model 
[BG02] . The booster is randomized but the hypotheses it produce (that are used 
for the predictions) are deterministic. We believe that the same technique used 
in this paper (section 4) may also be used for the booster in [BG02] to achieve 
the same result (with much greater time complexity and randomized prediction 
strategy) . 

The first part of the paper gives a PAExact-learning algorithm that uses 
deterministic hypothesis for any PAG-learnable class that achieves exponentially 
small error in the number of equivalence queries. In the second part we show 
how to turn this algorithm to a deterministic prediction strategy that achieves 
the required mistake probability. 

In section 2 and 3 we build a new deterministic booster for the PAExact- 
model and then in section 4 we show how to change the PAExact-learning algo- 
rithm to a prediction strategy that achieves the above bound. 

2 Learning Models and Definitions 

Let G be a class of functions / : X — >■ {0, !}• The domain X can be finite, 
countable infinite, or 7^" for some n > 1. In learning, a teacher has a target 
function f € C and a probability distribution D on X. The learner knows C but 
does not know the probability distribution D nor the function /. 

The problem size If that we will use in this paper depends on A, C and 
/ and it can be different in different settings. The term “polynomial” means 
polynomial in the problem size If. For example, for Boolean functions with 
X = {0, 1}", G is a set of formulas (e.g. DNF, Decision tree, etc.). The problem 
size is If = n+sizec{f) where sized f) is the minimal size of a formula in G that 
is equivalent to /. Then “polynomial” means poly(If) = poly {n, sized f))- Tor 
infinite domains X the parameter n is usually replaced by the VG-dimension 
of the class Vq and If = Vq + sizedf). Then “polynomial” in this case is 
poly{Vc, sized f)). 

The learner can ask the teacher queries about the target. The teacher can be 
regarded as an adversary with unlimited computational power that must answer 
honestly but also wants to fail the learner from learning quickly. The queries we 
consider in this paper are: 

Example Query according to D (Ex^j) [V84] For the example query the 
teacher chooses x G X according to the probability distribution D and returns 
{x,f{x)) to the learner. 
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We say that the hypothesis hr s- approximates f with respect to distribution 
D if Prc,r[/(a:) ^ hr{x)] < e. 

Equivalence Query according to D (EQd) [B97] For the equivalence query 
according to distribution D the learner asks EQ£>(/i) for some polynomial size 
circuit^ h. The teacher chooses y G Xf^h according to the induced distribution 
of D on XfAh and returns (y,f{y)). If Pr_D[X/^/i] = 0, the teacher answers 
“YES”. Equivalence queries with randomized hypothesis is defined in [BG02]. 

The learning models we will consider in this paper are 
PAC (Probably Approximately Correct) [V84] In the PAC learning model we say 
that an algorithm A of the learner PAC-learns the class C if for any f G C, any 
probability distribution D and any £, 5 > 0 the algorithm A(e, <5) asks example 
queries according to D, Ex/j, and with probability at least 1 — i5, outputs a 
polynomial size circuit h that e-approximates / with respect to D. That is 
PrD[XfAh] < £• We say that C is PAC-learnable if there is an algorithm that 
PAC-learns C in time poly{l/e,log{l/S), If)- 

PAExact (Probably Almost Exactly Correct) [BJT02] In the PAExact learning 
model we say that an algorithm A of the learner PAExact-learns the class C if 
for any f G C, any probability distribution D and any rj,S > ^ the algorithm 
A{t],S) asks equivalence queries according to D, EQ/j, and with probability at 
least 1 — 15, outputs a polynomial size circuit h that /^-approximates / with respect 
to D. That is PrD[XfAh] < ??• We say that C is PAExact-learnable if there is an 
algorithm that PAExact-learns C in time poly{log{l/r]),log{l/6),If)- 

In the online learning model [L88] the teacher at each trial sends a point 
X G X to the learner and the learner has to predict f{x). The learner returns 
to the teacher the prediction y. If f{x) yf y then the teacher returns “mistake” 
to the learner. The goal of the learner is to minimize the number of prediction 
mistakes. 

Online [L88] In the online model we say that algorithm A of the learner Online- 
learns the class C if for any f G C and for any S, algorithm A(<5) with probability 
at least 1 — 5 makes bounded number of mistakes. We say that C is Online- 
learnable if the number of mistakes and the running time of the learner for each 
prediction is po/y(log(l/5), //). 

Probabilistic Prediction (PP) [HLW94] In the Probabilistic Prediction model 
the points sent to the learner xi,X 2 , - ■ ■ are chosen from X according to some dis- 
tribution D. The goal of the prediction strategy at trial t is to predict f{xt) with 
minimal mistake probability. We say that C is e-PP-learnable if the prediction 
strategy runs in time poly{If , log t) and achieve mistake probability e. 



^ For infinite domains X, the definition of “circuit” depends on the setting in which 
the elements of C are represented. The hypothesis h must have polynomial size in 
this setting. E.g., if X = 77" we may ask of h to be a polynomial size arithmetic 
circuit 
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3 The New Algorithm 

In this section we give our new booster for the PAExact learning model and 
prove its correctness. In Subsection 3.1 we show how to start from a hypothesis 
that approximates the target function / and refine it to get a better one. In 
Subsection 3.2 we give the main algorithm and prove its correctness. 

3.1 Refining the Hypothesis 

We will first give a booster for the PAExact-learning model that takes a hy- 
pothesis that ry-approximates the target and builds a new hypothesis that ri/2- 
approximates the target. 

Let A be a PAC-learning algorithm that learns the class C in polynomial 
time from mjx{e,S, If) examples. Let hg be a hypothesis such that 

Pr[/ 7 ^ ho] < t], ( 1 ) 

Our booster learns a sequence of hypotheses H = hi,h 2 ,h 3 , . . . ,hk and then 
uses this sequence to build the refined hypothesis. 

We start with the following notation. Let 

H^ = \ \ and Hj = \ ^ ■ 

^ \hi A ■ ■ ■ A hj-i J>1 ^ hj-i J > 1 

Let Hj = hoH^ V hoH)- and Gj = hoHh v hoHj . 

Now we show how the booster learns hj from hi,h 2 , ■ ■ ■ ,hj-i. The booster 
runs the procedure Learnh(j, £, 5). See Figure 1. This procedure either returns 
a refined hypothesis h (see steps 10 and 11 in Learnh) or returns the next 
hypothesis hj in the sequence TL (see step 14 in Learnh). In the former case 
hj =NULL indicating that hj-\ is the last function in the sequence H and then 
H = hi,h 2 , ■ ■ ■ ,hk for fc = j — 1. In the latter case a new function hj is generated 
in %. We will show that for some £ = I / poly {\og{l / rj)) and k = poly {log{l /rj)) , 
either ho or Hk or Hk+i (this depends where the algorithm returns in the last 
call for Learnh. In step 10, 11 or 14, respectively) is an 77 / 2 -approximation of 
/. For the analysis of the algorithm we define three values: For j < k 

'^3 = ^j^[hoHjhj = l,f=l]+ Pr[/ioiL/ h^ = 1, / = 0] 

u, = Pr[/7oi?;+i = 1, / = 0] + Pr[/ioi5}VT = 1, / = 1] 

Vj = Pj[Vo = 1 , / = 1 ] + PjihoHj^, = 1 , / = 0 ] 

We prove the following 

Property 1. We have (1) Uo < 1. (2) Vj = (^) ^ Hj+i] = 

Uj+Vj. 
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Learnh {j, e, S) 

1) Set m m^(e/2, 5, If)-, ro <— 0; Vj <— 0. 

2) For i 1 to m 

3) Flip a fair coin — ^result 

4) If result=“Head” Then 

5) Repeat EQd(/io) ^ {xi, f{xi))-, ro ro + 1; 

6) Until Gj{xi) = 1 or ro = 4m. 

7) Else 

8) Repeat EQD(R'j) ^ (xi,f{xi))-, rj -s- r^ + 1; 

9) Until Gj{xi) = 1 or rj = 4m. 

10) If ro = 4m Then Return(/ij =NULL, h = ho) 

11) If Tj = 4m Then Return(/ij =NULL, h = Hj) 

12) S^5u{(xi,/(a;0)}. 

13) Run A with examples S — >■ hj 

14) Return(/ij, h =NULL) 



Fig. 1. The algorithm Learnh(j, e, 5) learns the jith fnnction in the seqnence H. 



Claim 3.1 For every j, with prohahility at least 1 — S we have Wj < erj and Uj < 
euj-\. 



Claim 3.2 With probability at least 1 — jS we have: For all i < j, we have 
Wi < sr], Ui < Vi < isrj and Pru)/ ^ ^^i+i] < isp + eh 



Claim 3.3 //Pro)/ ^ ho] > 2{j — l)er] then the probability that Learnh(j, e, 5) 
returns h = ho is less than jS. 



Claim 3.4 IfPijolf ^ Hj] > 2{j—l)e'q then the probability Learnh(j, e, 5) 
returns h = Hj is less than j6. 

The first and the second claims give bounds for Wi, Ui and Vi and show that 
with high probability the error of the hypothesis ilj+i is less than jer] + eF The 
other two claims show that if the algorithm stops in steps 10 or 11 then with 
high probability the hypothesis ho or Hj, respectively, achieves error at most 
2{j — l)e? 7 . In the next subsection we will choose j and e such that those errors 
are less than ri/2. 

Proof of Property 1. We have uq = Proif = ho] < 1 which follows 1. Now 
Vj = Pj[Vo H^^ = l,f=l]+ PjihoHj^j = 1 , / = 0 ] 

3 j 

= = 1 ,/ = l] + Pr[hoH^h, = 1 ,/ = 0 ]) = '^w^. 

2=1 2=1 
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This follows 2. Finally we have 



Pr[/ ^ = Pr[/ ^ iF,+i, /lo = 0, / = 0] + Pr[/ ^ iF,+i, /lo = !,/=!] + 

Pr[/ ^ iF,+i, /lo = 0, / = 1] + Pr[/ ^ iF,+i, /iq = 1, / = 0] 

= Pr[/ioi?;+i = 1, / = 0] + Pr[/ioi?i^ =!,/ = !] + 

Pr[7^ 75^ =!,/=!]+ Pr[/ioi?/+i = 1, / = 0] 

= + Uj. 

and this follows 5.D 

Proof of Claim 3.1. When Learnh learns hj it asks with probability 1/2, 
EQd(/io) and with probability 1/2, FiQ£>{Hj) and takes only points Xi that 
satisfies Gj{xi) = 1 (see steps 5-6 and 8-9 in Learnh). Let Dj be the probability 
distribution of Xi. Since the events f ^ ho,Gj = 1 and / yf Hj,Gj = 1 are 
disjoint (take two cases / = 0 and / = 1 and use property P4) and since the 
algorithm takes m^{e/2,5,If) examples to learn hj, with probability at least 
1 — i5 we have 



= ^ Pr[/ h,\f ^ h„ G, = 1] + i Pr[/ h,\f ^ H„G, = 1] 



By (1) and (2) we have 



£ > Pr[/ y^ hj\f yf /lo, Gj = 1] = 



D 



Prp[/ hj,f yf fep, Gj = 1] 
Prp[/ ho,Gj = 1] 



> 



= !,/=!]+ VvD[homh, = 1, / = 0] 



(2) 



Therefore Wj < erj. 
By (2) we have 



S > Pr[/ yf hj\f yf Hj, Gj — 1] — 



Prp[/ 7^ hj,f yf Hj, Gj = 1] 
D- ■ - ■ - ^ ^ PrD[f^H„Gj = l] 

Prc^g/Vi = 1, / = 0] + FrojkoHj:^, = !,/=!] ^ Uj 

PTD[hom = 1 , / = 0 ] + FrolhoHj = 1 , / = 1 ] " Uj-i 



Therefore Uj < suj-i. □ 

Now the proof of Claim 3.2 follows from Property 1 and Claim 3.1. 
Proof of Claim 3.3. We have Frjy[Gj = 0|/ yf /iq] is equal to 

Pr^[G, = 0, / y^ ho] Prp[^ 77^ = 1, / = 1] + FroihoHj = 1, / = 0] 



Prp[/ 7^ ho] 



Prp[/ 7^ ho] 



< 



Vj-i 

2(7 - l)e?7 
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By Claim 3.2, with probability at least 1 — (j — 1)5, Pr£)[Gj = 0|/ ^ /iq] < 1/2. 

Suppose Learnh calls the equivalence query FiQD{ho), 4m times. Let Xr 
be a random variable that is equal to 1 if the rth call of F,Qi){ho) returns a 
counterexample x' such that Gj{x') = 0 and = 0 otherwise. Then 

E[Xr] = Pr[G, = 01/ ^ho]<l. 

If Learnh(j, e, S) outputs ho then since the algorithm makes at most m coin 
flips (see steps 2-3 in Learnh) we have 



4m 

Xi > 3m. 

i=l 



Now given that {Xr}r are independent random variables and E[Xr] <1/2 and 
using Chernoff bound we have Pr[Learnh(j, £, 5) outputs ho] is 



' 4m 



Pr 



Xi > 3m 



< Pr 



Li=i 



4m 



> E[Xr] + - 



< e-(™/4) < 5 



The later inequality follows because m > 41n(l/5). Therefore, the probability 
that Learnh(/, e, 5) outputs ho is at most j5.U 
Proof of Claim 3.4: We have Prjo[Gj = 0|/ yf Hj] is 



Pr^ [G, = 0, / H,] Prz5 [hoHj = 1, / = 0] + Pr^, [ho = IJ = 1] 









< 



Vi-l 



2{j - l)er]' 

Then the proof is exactly the same as the proof of Claim 3.3.D 



Refine (ho,k,e,S) 

1) j^0 

2) Repeat 

3) i + 1 

4) Learnh)/ e, 5/(3fc^)) —>■ (/ly, 5) 

5) Until j — k or hj =NULL 

6) If j = fc Then h ^ Hk+i 

7) Return(/i). 



Fig. 2. A PAExact-learning algorithm that refine ho 



We now can build the procedure that refines the function ho- In Figure 2 the 
procedure Refine runs Learnh at most k times. It stops running Learnh and 
output a refined hypothesis if one of the following happen: 
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1. The function hj is equal to NULL and then it outputs either ho or Hj 
(depends what is h). 

2. We get j = k and then it outputs Hk+i- 
We now prove 

Lemma 1 . Suppose Pr£)[/ ^ ho] < rj and h = Refine(/io, A:,£, 5). Then with 
probability at least I — 6 we have 

I^r[/ h] < max{keri + 2kerj). 

Proof. Let % = h\,h 2 , ■ ■ ■ ,ht, t < k be the sequence of hypotheses generated 
by Learnh. Let 5' = 6/{3k^). We want to measure the probability that the 
algorithm fails to output a hypothesis h that ry'-approximates / where rj' = 
max{keri + e^,2kerj). This happen if and only if one of the following events 
happen: 

[Ai] For some j = t < k, Learnh(j, e, S') outputs h = ho and Pr£)[/ ^ ho] > p' ■ 
[A 2 ] For some j = t <k, Learnh(j, £, <5') outputs h = Hj and Pr£i[/ ^ Hj] > 

i- 

[A 3 ] We have t=k and Pro]/ ^ -fffc+i] > r]' . 

Now since for j = 1, . . . , fc we have 2{j — l)er] < 2kerj < rj' , by Claim 3.3 

Pr[Ai] < Pr[3 1 < j < fc : Learnh(j, e, S') 

outputs h = ho and Pr jo[f ho] > 2{j — l)ep] 

<tdS'<k^s' = l 

i=i 

In the same way one can prove Pr[A 2 ] < Now since kerj + < 77 ', by Claim 

3.2 

Pr[As] < Pr[Pr[/ H^+i] > kep + e'^] < kS' < -. 

D O 

Therefore, the probability that the algorithm fails to output a hypothesis that 
rj' approximates / is less than 5.D 



3.2 The Algorithm and Its Analysis 

We are now ready to give the PAExact-learning algorithm. We will first give the 
algorithm and prove its correctness. Then we give the analysis of the algorithm’s 
complexity. 

Let A be a PAC-learning algorithm that learns C in polynomial time and 
m^{e, 5,1 f) examples. In Figure 3 , the algorithm PAExact-Learn(? 7 , 5) defines 



log log i 
leiogi ’ 




and k = 



' 21og^ ' 
log log i 



(3) 



The algorithm first runs A to get some hypothesis ho- Then it runs Refine 
[log(l/? 7 )] times. From the above analysis the following Theorems follows. 
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PAExact-Learn ( 77 , S) 



1) Set 



k 



'°g I c/ _s 

16 log i ’ 6 log i 

2) Run A with rriAis, S' , I f) examples 

3) For t t- 1 to [log 

4) ho t— Refine(/io, fc, £, <50 

5) Output(ho) 



2i°g ^ ' 

log log i 

— ho 



Fig. 3. An PAExact-learning algorithm that learns the class C with error rj and con- 
fidence S. 



Theorem 1. fCorrectnessj Algorithm PAExact-Learn(77, 5) learns with 
probability at least 1 — 6 a hypothesis that rj- approximates f . 

Proof or Theorem 1. Let h^Q\h^\...,h^\ t = |"log(l/ 77 )] be the functions 
learned in line 4 of the algorithm. Here = ho is the hypothesis that is 
learned in line 2 of the algorithm. We have with probability at least 1 — 5', 
Pr[/ yf hg°^] < e and by Lemma 1 with probability at least 1 — 5' we have 

Pr[/ yf hg*^] < max(fc£? 7 o -|- e^, 2kerjo) 

where Pri)[/ yf = rjQ. Now since kepo + < ??o + i?/4 and 2kerio < ?7o/2 

and since max (770 -|- 77 / 4 , rjo/2) < max (? 7 o/ 2 , rf) , we have 



ft[/ yf hg^^] < max 



IVrD[[±h!i^ 



Therefore, with probability at least 1 — 5 we have Pr£i[/ yf < 77 . This 

completes the proof of the Theorem. □ 

For the analysis of the algorithm we first give a very general Theorem and 
then apply it to different settings. 

Theorem 2. (lEfRciencyj Algorithm PAExact_Learn(77, 5) uses 

leiog^i ^ Aoglogi (log log 5 \ 

log log 4 |^321ogi’ 721og^4 

equivalence queries. 

Proof of Theorem 2. We will use the notations in (3). Algorithm PAEx- 
act_Learn(77, 5) calls the procedure Refine(/ig, fc, £, 5'), |"log(l/ 77 )] times. The 
procedure Refine {ho,k,s,6') calls the procedure Learnh (j, £, 5'/3/c^), k 
times and the procedure Learnh(j, £, S' /3k^) calls the example oracle at most 
8?71_4(£/2, 5'/(3/c^), //) times. This follows the result. □ 

It follows from Theorem 2 
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Theorem 3. If C is PAC-learnable with error 5—7 and confidence A with 
sample of size Vq, then C is PAExact-learnable with 

d{5, ri) = 0 ^wlog i ^ log^ , 

equivalence queries where 




and time polynomial in d and 1/A. 

Proof. Follows from Theorem 2 and Corollary 3.3 in [F95].D 
Before we leave this section we give the algorithm that will be used in the next 
section. We will use d{S,rj) to denote the complexity of the algorithm PAExact- 
Learn( 77 , 5). Let 77 (d) be a function of d such that d = 2d{6, 77 (d)). We now prove 



Theorem 4. The algorithm PAExact-Learn(d) after the dth mistake, will be 
holding a hypothesis that with probability at least 1 — d has error rj{d). 

Proof. Set a constant dp. We run PAExact-Learn(d, 77 (do)) twice and after 2do 
mistakes we run PAExact-Learn(d, rj{2dfi)) and so on. When d = dp + dp + 2dp + 
• • • + 2 *dp = 2 *+^dp with probability at least 1 — d the final hypothesis has error 
77 where 2*dp = d(d, 77 ). This gives d = 2d(d, 77 (d)). □ 

4 A Prediction Strategy and Its Analysis 

In this section we use the algorithm PAExact-Learn(d) to give a deterministic 
prediction strategy. Then give an analysis of its mistake probability. 

First we may assume that t is known. This is because we may run our pre- 
diction strategy assuming t = to and get a prediction strategy with mistake 
probability e(tp). If t > to then at trials tp -I- 1, tp -I- 2, . . . ,3tp we use the pre- 
diction strategy used in trial tp and at the same time learn a new prediction 
strategy (from the last 2tp examples) that has mistake probability e(2tp). It is 
easy to see that this doubling technique will solve the problem when t is not 
known. 

Second we may assume that t is large enough. As long as t is polynomial 
in the other parameters then we can use the PAC-learning algorithm to learn 
a hypothesis and use this hypothesis for the prediction. This hypothesis will 
achieve error logt/t. 

We also need a bound on the VC-dimension of the class of all possible 
output hypotheses of PAExact-Learn (d) at trial t. Obviously this cannot 
be more than the number of examples we use in PAExact algorithm which is 
poly{logt,log{l/6),If). We denote this by Vfi. 

The strategy prediction algorithm is described in Figure 4. The procedure 
saves the hypotheses ho,hi,h 2 , ■ ■ ■ generated from PAExact-Learn(d) and for 
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Predict {xt, S = ((/lo, to),---, [hd, td))) 

1) Initial /lo = 1; to = 0; S = ((/lo, to))- 

2) Let £ = argmaxti. 

3) Find rjo and rji in 

d — 2d{rjo,rio) and 
^ — log — + — log — 

d VI ° VI VI ° VI 

4) If rji < rjo Predict hi{xt) Else Predict hdixt)- 

5) Receive f{xt) 

6) If hd{xt) ^ f{xt) Then 

7) Answer xt to EQo)/*^) that is asked in PAExact-Learn [rjo) and 

8) Receive a new hypothesis hd+i when the next EQoihd+i) is asked 

9) Add S ^ {S,{hd+i,0))- 

10) Else 

11 ) td=td + l- 

12) Goto 2. 



Fig. 4. A deterministic prediction strategy. 



each hypothesis hi it saves ti the number of examples Xj in which hi predicted 
correctly. Notice that the algorithm in line 4 does not necessarily choose the 
last hypothesis hd for the prediction. In some cases, (depends on rjo and rji) it 
chooses the hypothesis that is consistent with the longest sequence of consecutive 
examples (see line 2-4 in the algorithm). 

The idea of the proof is very simple. If the number of mistakes d is “large” 
then the probability of the mistake of the final hypothesis is small. Otherwise, 
(if d is small) then there is a hypothesis that is consistent with t/d consecutive 
examples and then this hypothesis will have a small prediction error. 

We prove the following Theorem 



Theorem 5. The probability of the prediction error of the strategy Predict is 
smaller than 

poly{log{t)) 

t 

and the running time in each trial is polyilogt) . 

Proof Sketch. Notice that the number of mistakes d = polyifogt) and the 
size of each hypothesis hi is at most poly (log t). Therefore, the running time is 
poly(logt) at each trial. 

If Vo < Vi then d = 2d(r]o, rjo) and by Theorem 4 the hypothesis hd is with 
probability 1 — rjo has error rjQ. Therefore hd will predict Xt+i with mistake 
probability at most 2r]o. 

If rji < Tjo then since hg is consistent on at least 



t 

d 




1 

Vi 



— log 
Vi 



Vi 
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consecutive examples, and since there is at most subsequences of consecutive 
examples, then by OCCAM, with probability at least 1 — ? 7 i the hypothesis hi 
has error rji. Therefore hi predicts /(xt+i) with probability mistake at most 2r]i. 
This implies that the probability of the prediction mistake at trial t is at most 
r] = 2min(77o,?7i). 

Since t is fixed we can consider tjq and rji as functions of d. The error rjo is 
monotonically decreasing as a function of d and rji is monotonically increasing 
as a function of d. Therefore, 77 < 2rj' where rj' = ijq = rji. Replacing rjQ and 771 
by 77 ' we get 



d = O w log - + — log , 

V V 7 7 J 



LU = 



321og"A 



and 



Then 



t Vf* 1 1 

- = — log — + - log-. 
a Tf Tf rj' rj' 



'V/ 1 1 P0?77 (log A) poZ7/(logt) 

t = d ( — log - + — log — = ^ ^ 

' 77 77 ' 77 77 ' / 



77 ' 



Which implies 77 < 277 ' = poly(logt)/t.O 

References 

[A88] D. Angluin. Queries and concept learning. Machine Learning, 2:319-342, 

1987. 

[B94] A. Blum. Separating distribution-free and mistake-bound learning models 

over the boolean domain, SIAM Journal on Computing S3(5),pp. 990- 

1000.1994. 

[B97] N. H. Bshouty, Exact learning of formulas in parallel. Machine Learning, 

26, pp. 25-41,1997. 

[BC-l-96] N. H. Bshouty, R. Cleve, R. Gavalda, S. Kannan, C. Tamon, Oracles and 
Queries That Are Sufficient for Exact Learning. Journal of Computer and 
System Scienees 52(3): pp. 421-433 (1996). 

[BG02] N. H. Bshouty and D. Gavinsky, PAC=PAExact and other equivalent 
models in learning Proceedings of the 43rd Ann. Symp. on Foundation of 
Gomputer Science (FOGS), pp. 167-176 2002. 

[BJT02] N. H. Bshouty, J. Jackson and C. Tamon, Exploring learnability between 
exact and PAG, Proceedings of the 15th Annual Gonference on Gomputa- 
tional Learning Theory, pp. 244-254 2002. 

[F95] Y. Freund, Boosting a weak learning algorithm by majority. Information 

and Computation, 121, 256-285 (1995). 

[HLW94] D. Haussler, N. Littlestone and M. K. Warmuth, Predicting 0,1-functions 
on randomly drawn points. Information and Computation, 115,pp. 248- 

292.1994. 




76 



N.H. Bshouty 



[KM96] 

[L88] 

[MAOO] 

[003] 

[S90] 

[V84] 



M. Kearns and Y. Mansour, On the Boosting Ability of Top-Down De- 
cision Tree Learning Algorithms, Proceedings of the 28th Symposinm on 
Theory of Compnting, pp. 459-468,1996. 

N. Littlestone. Learning when irrelevant attributes abound: A new linear- 
threshold algorithm. Machine Learning, 2:285-318, 1988. 

Y. Mansour and D. McAllester, Boosting using Branching Programs, Pro- 
ceedings of the 13th Annual Conference on Computational Learning The- 
ory,pp. 220-224,2000. 

D. Gavinsky and A. Owshanko, PExact=Exact learning, manuscript. 

R. E. Schapire, The strength of weak learnability. Machine Learning, 
5(2)pp. 197-227, 1990. 

L. Valiant. A theory of the learnable. Communications of the ACM, 
27(11):1134-1142, November 1984. 




Minimizing Regret with 
Label Efficient Prediction* 



Nicolo Cesa-Bianchi^, Gabor Lugosi^, and Gilles Stoltz^ 

^ DSI, Universita di Milano 
via Comelico 39, 20135 Milano, Italy 
cesa-bianchi@dsi .unimi . it 
^ Department of Economics, Universitat Pompeu Fabra 
Ramon Trias Fargas 25-27, 08005 Barcelona, Spain 
lugosiSupf . es 

® Laboratoire de Mathematiques, Universite Paris-Sud, 
91405 Orsay Cedex, France 
gilles . stoltz@math.u-psud.fr 



Abstract. We investigate label efficient prediction, a variant of the 
problem of prediction with expert advice, proposed by Helmbold and 
Panizza, in which the forecaster does not have access to the outcomes of 
the sequence to be predicted unless he asks for it, which he can do for a 
limited number of times. We determine matching upper and lower bounds 
for the best possible excess error when the number of allowed queries is 
a constant. We also prove that a query rate of order (lnn)(lnlnn)^/n 
is sufficient for achieving Hannan consistency, a fundamental property 
in game-theoretic prediction models. Finally, we apply the label efficient 
framework to pattern classification and prove a label efficient mistake 
bound for a randomized variant of Littlestone’s zero-threshold Winnow 
algorithm. 



1 Introduction 

Prediction with expert advice, a framework introduced about fifteen years ago 
in learning theory, may be viewed as a direct generalization of the theory of 
repeated games, a field pioneered by Hannan in the mid-fifties. At a certain level 
of abstraction, the common subject of these studies is the problem of forecasting 
each element yt of an unknown “target” sequence given the knowledge of the 
previous elements yi,... ,yt-i- The forecaster’s goal is to predict the target 
sequence almost as well as any forecaster using the same guess all the times. We 
call this the sequential prediction problem. To provide a suitable parametrization 
of the problem, we assume that the set from which the forecaster picks its guesses 
is finite of size fV > 1, while the set to which the target sequence elements belong 
may be of arbitrary cardinality. A real-valued bounded loss function £ is then 
used to quantify the discrepancy between each outcome yt and the forecaster’s 

* The authors gratefully acknowledge partial support by the PASCAL Network of 
Excellence under EC grant no. 506778. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 77—92, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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LABEL EFFICIENT PREDICTION 

Parameters: number N of actions, outcome space y, loss function t, time horizon 
n, budget m of queries. 

For each round t = 1, . . . ,n 

(1) the environment chooses the next outcome yt &y without revealing it; 

(2) the forecaster chooses an action It € {I,-- - ,N}; 

(3) each action i incurs loss 

(4) if less than m queries have been issued so far the forecaster may issue a new 
query to obtain yt’, if no query is issued then yt remains unknown. 



Fig. 1. Label efficient prediction as a game between the forecaster and the environ- 
ment. 



guess for yt- Hannan’s seminal result [7] showed that randomized forecasters 
exist whose excess cumulative loss (or regret), with respect to the loss of any 
constant forecaster, grows sublinearly in the length n of the target sequence, 
and this holds for any individual target sequence. In particular, Hannan found 
the optimal growth rate, 0{y/n), of the regret as a function of the sequence 
length n when no other assumption other than boundedness is made on the loss 
I. Only relatively recently, Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire, 
and Warmuth [4] have revealed that the correct dependence on N in the minimax 
regret rate is 0{^fn\xiN). 

Game theorists and learning theorists, who independently studied the se- 
quential prediction model, addressed the fundamental question of whether a 
sub-linear regret rate is achievable in case the past outcomes yi,... ,yt-i are 
not entirely accessible when computing the guess for yt. In this work we investi- 
gate a variant of sequential prediction known as label efficient prediction. In this 
model, originally proposed by Helmbold and Panizza [8] , after choosing its guess 
at time t the forecaster decides whether to query the outcome yt. However, the 
forecaster is limited in the number of queries he can issue within a given time 
horizon. We prove that a query rate of order (lnn)(lnlnn)^/n is sufficient for 
achieving Hannan consistency (i.e., regret growing sub-linearly with probability 
one). Moreover, we show that any forecaster issuing at most m queries must suf- 
fer a regret of at least order (In N) I'm on some outcome sequence of length 
n, and we show a randomized forecaster achieving this regret to within constant 
factors. We conclude the paper by proving a label efficient mistake bound for a 
randomized variant of Littlestone’s zero-threshold Winnow, an algorithm based 
on exponential weights for binary pattern classification. 
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2 Sequential Prediction and the Label Efficient Model 



The sequential prediction problem is parametrized by a number iV > 1 of player 
actions, by a set y of outcomes, and by a loss function £. The loss function has do- 
main {1, . . . , N} X y and takes values in a bounded real interval, say [0, 1]. Given 
an unknown mechanism adaptively generating a sequence y\,y 2 , ■ ■ ■ of elements 
from 3^, a prediction strategy, or forecaster, chooses an action It G {1, • ■ • ,N} 
incurring a loss £{It,yt)- A crucial assumption in this model is that the fore- 
caster can choose It only based on information related to the past outcomes 
2/1 ,... ,2/t-i- That is, the forecaster’s decision must not depend on any of the 
future outcomes. In the label efficient model, after choosing It the forecaster 
decides whether to issue a query to access yt- If no query is issued, then yt re- 
mains unknown. In other words. It does not depend on all the past outcomes 
2/1 , .. . , 2/t-i, but only on the queried ones. The label efficient model is best de- 
scribed as a repeated game between the forecaster, choosing actions, and the 
environment, choosing outcomes (see Figure 1). 



3 Regret and Hannan Consistency 

The cumulative loss of the forecaster on a sequence 2/1 j 2/2,--- of outcomes is 
denoted by 



^ £{It, yt) for n > 1. 

t=i 

As our forecasting strategies are randomized, each It is viewed as a random 
variable whose distribution over {!,... , N} must be fully determined at time t. 
Without further specifications, all probabilities and expectations will be un- 
derstood with respect to the cr-algebra of events generated by the sequence 
/i, l2 , . . . of the forecaster’s random choices. We compare the forecaster’s loss 
with the cumulative losses of the N constant forecasters, = X)”=i^(byt)) 
z=l,...,iV. 

In particular, we devise label efficient forecasting strategies whose expected 
regret E L„ — minj^i^,., Li,n grows sublinearly in n for any individual sequence 
2/1,2/25 • of outcomes. Via a more refined analysis, we also prove the stronger 

result 



— min Tj „ = o{n) a.s. , 

,N ’ 

for any sequence yi,y 2 , ■ ■ ■ of outcomes, almost surely with respect to the auxil- 
iary randomization the forecaster has access to. This property, known as Hannan 
consistency in game theory, rules out the possibility that the regret is much larger 
than its expected value with a significant probability. 
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Parameters: Real numbers r/ > 0 and 0 < e < 1. 
Initialization: mi = (1, , 1). 

For each round t = 1,2,... 



(1) draw an action from {1, . . . , A^} according to the distribution 



Wi,t 






i = 1,... ,N 



■Wj.t 



(2) draw a Bernoulli random variable Zt such that ¥[Zi = 1] = e; 

(3) if Zt = 1 then obtain yt and compute 



Wi^t+l 



Wi,t 






for each i = 1, . . . ,N 



else, let wt+i = wt- 



Fig. 2. The label efficient exponentially weighted average forecaster. 



4 A Label Efficient Forecaster 



We start by introducing a simple forecaster whose expected regret is bounded by 
ni/2(ln N)/m, where m is the bound on the number of queries. Thus, if m = n 
we recover the order of the optimal experts bound. It is easy to see that in 
order to achieve a nontrivial performance, a forecaster must use randomization 
in determining whether a label should be revealed or not. It turns out that a 
simple biased coin does the job. The strategy we propose, sketched in Figure 2, 
uses an i.i.d. sequence Z\,Z 2 , ... ,Zn of Bernoulli random variables such that 
P[Zi = 1] = 1 — V[Zi = 0] = e and asks the label yt to be revealed whenever 
Zt = 1. Here e: > 0 is a parameter of the strategy. (Typically, we take min 
so that the number of solicited labels during n rounds is about m. Note that 
this way the forecaster may ask the value of more than m labels but we ignore 
this detail as it can be dealt with by a simple adjustment.) Our label efficient 
forecaster uses the estimated losses 



ei^,yt) = 



e{i,yt)le if Zt = 1, 
0 otherwise. 



Note that E[£{i,yt) \ Zl~^,l[~^] = £{i,yt), where Z\ = (Zi,... , Zt-i) and 
= (Ii, . . . (The conditioning on Z^~^ and is merely needed to fix 

the value of yt, which may depend on the forecaster’s past actions.) Therefore, 
£{i,yt) may be considered as an unbiased estimate of the true loss £{i,yt). The 
label efficient forecaster then uses the estimated losses to form an exponentially 
weighted average forecaster. The expected performance of this strategy may be 
bounded as follows. 
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Theorem 1. Consider the label efficient forecaster of Figure 2 run with e = 
m/n and tj = (V2mhiN)/n. Then, the expected number of revealed labels equals 
m and 

E L„ — min Li „ < n 

In the sequel we write for the IV- vector of components pi^t- We also use the 
notation 

N N 

^{Pt,yt) = '^Pi,ti{i,yt) and i{Pt,yt) = '^Pi,ti{i,yt) ■ 

Finally, we denote for i = 1, . . . , A^, 

n 

Li^n — 'y ^ -^(L yt) ■ 

t=l 

Proof. It is enough to adapt the proof of [1, Theorem 3.1], in the following way. 
First, we note that we have an upper bound over the regret in terms of squares 
of the losses, see also [12, Theorem 1], 

Since € [0, 1/s] for all j and yt, we finally get 

^ / 71 \ In /V 

'^^{Pt^yt)[y- y) i=l,...,N. (1) 

t=l ^ 

Taking expectations on both sides and substituting the values of rj and e yields 
the desired result. 




Theorem 1 guarantees that the expected per-round regret converges to zero 
whenever m — >■ oo as n — >■ oo. The next result shows that in fact this re- 
gret is, with overwhelming probability, bounded by a quantity proportional to 
n 

Theorem 2. Let 6 € (0, 1) and consider the label efficient forecaster of Figure 2 
run with parameters 





Then, with probability at least 1 — i5 the number of revealed labels is at most m 
and 




min Li „ < 2n 
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In the full paper, we will prove a more refined bound in which the factors 
Ua/ (I n TV) /to are replaced by (1 + o{l)) n L* {h\ N) / m in all cases where L* , 
the cumulative loss of the best action, is Q{{n/m)\nN). In the cases when 
L* is small, then the quantity replacing the above terms is of the order of 
(njm) In N . In particular, we recover the behavior already observed by Helmbold 
and Panizza [8] in the case L* = 0 (the best expert makes no mistakes). 

Even though the label efficient forecaster investigated above assumes the 
preliminary knowledge of the time horizon n (just note that both 77 and e depend 
on the value of the parameters n and m), using standard adaptive techniques — 
such as those described in [2] — , a label efficient forecaster may be constructed 
without knowing n in advance. By letting the query budget m depend on n, one 
can then achieve Hannan consistency, as stated in the next result. 

Corollary 1. There exists a randomized label efficient forecaster that achieves 
Hannan consistency while issuing, for all n > 1, at most 0((ln In n)^ In n) queries 
in the first n prediction steps. 

Proof. An algorithm that achieves Hannan consistency divides time into consec- 
utive blocks of exponentially increasing length 1, 2, 4, 8, 16, ... . In the r-th block 
(of length 2’’“^) it uses the forecaster of Theorem 2 with parameters n = 2'’“^, 
m = (In r) (In In r) and S = 1/r^. Then, using the bound of Theorem 2 it is 
easy to see that, with probability one, for all n, the algorithm does not ask for 
more than 0((lnlnn)^ Inn) labels and the cumulative regret is o(n). Details are 
omitted. Just note that it is sufficient to prove the statement for n = 2’’“^ for 
r > 1. 

Before proving Theorem 2, note that if <5 < then the right-hand 

side of the inequality is greater than n and therefore the statement is trivial. 
Thus, we may assume throughout the proof that S > 4iVe“™/®. This also ensures 
that £ > 0. We need a number of preliminary lemmas. The first is obtained by 
a simple application of Bernstein’s inequality. 

Lemma 1. The probability that the strategy asks for more than m labels is at 
most 6/4. 



Lemma 2. With probability at least 1 — <5/4, 



Vt) ?/t)-|-2n^^ln| . 

i=l t=l 



m 5 

Furthermore, with probability at least 1 — 6/4, for all i = 1, . . . ,N, 
Fi.n ^ hji,n 2'\/2n'\ 



ln(47V/J) 



Proof. The proofs of both inequalities rely on Chernoff’s bounding. We therefore 
only prove the first one. Let s < 1 be a positive number. Define u = 2 ^ In | 
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and observe that since n/m >1/ (2e) (which is implied by the above assumption 
on i5), 



< E 



= E 



L 

exp I s Vt) -£{Pt,yt)) 

V t=i 

( n-1 

S E {£{Pt,yt) -£{Pt,yt))j 
exp (s(£(p„,y„) - F(p„,y„))) | 



(by Markov’s inequality) 



X E 



To bound the right-hand side, note that t!(p„,y„) —£{p„,yn) < 1 and therefore, 
since we assumed s < 1, 

E exp (^s{£{p„,y„) -I{p„,yn))j | 

<E I + s{i{p^,yn) -I{Pn,yn)) + s'^{e{p^,yn) -I{Pn,yn)f \ 

(since e’” < 1 -I- a; -I- for all a; < 1) 



= 1-4-E 



S^Wp„,yn) - £{Pn,yn)r \ zr\ir^ 



(since IE[(^(p„, y„) - £{Pn, 2/«)) \ ,1^ ] = 0) 

, s' 

< 1 + — 
e 



where the last step holds because 



E 



(£(p„,y„)-£(p„,y„))2|Zr\/ri 



< E 



£(p„,y„)'i^r\/i <i/£- 



Therefore, using 1 -|- s^/e < we have 



J2^^Pt,yt) > +' 



< E 



exp i sJ2WPt,yt) - KPt^yt)) 






< ^ns^/e^-su 



by repeating the previous argument n — 1 times. The value of s minimizing the 
obtained upper bound is s = us /2n which satisfies the condition s < 1 because 
n > m > US / 2 due to our assumption on S. Resubstituting this choice for s we 
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get 



yt) yt) 






6 

4 ’ 



and the proof is completed. 

Proof (of Theorem 2). We start again from (1). It remains to show that „ is 
close, with large probability to its expected value and that X]"=i yt) is 
close to X;r=i ^(Pt. yt) = Ln- 

A straightforward combination of Lemmas 1 and 2 with (1) shows that with 
probability at least 1 — 3<5/4, the strategy asks for at most m labels and has an 
expected cumulative loss 



Y,^iPt,yt) (i- 



< 



min Lin 



+ 4V2n 




In N 
V 



which, since 2/*) — implies 



'^^{Pt,yt) 



min 

i—l,... ,n 



<^+4V2nJ-ln^ 



2e 



1 , 4fV 



TO 



= 2n 



In N 



+ 4'/2n\ — In 



In N 
V 

Jn 



by our choice of rj and using l/(2£) < n/m derived, once more, from our assump- 
tion S > dlVe”™/®. The proof is finished by noting that the Hoeffding-Azuma 
inequality implies that, with probability at least 1 — S/4, 



X] yt) < ^^Pt^ yt) + 



<'^£{pt,yt) 



n 



1 , 4N 

In — — 

2 TO 0 



since m < n. 



5 A Lower Bound for Label Efficient Prediction 

Here we show that the performance bounds proved in the previous section for 
the label efficient exponentially weighted average forecaster are essentially unim- 
provable in the strong sense that no other label efficient forecasting strategy can 
have a significantly better performance for all problems. Denote the set of natural 
numbers by N = {1, 2, . . . }. 

Theorem 3. There exist an outcome space y, a loss function t' : N x — >■ [0, 1], 
and a universal constant c > 0 such that, for all N > 2 and for all n > m > 
20Y^ln(A^— 1), the cumulative (expected) loss of any (randomized) forecaster 
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that uses actions in {1, . . . , N} and asks for at most m labels while predicting a 
sequence of n outcomes satisfies the inequality 



sup E 
yi,--- ,Vn,ey \ 









. min > 

z=l,... ,N I 

’ ’ / 



cn 



ln(iV - 1) 



In particular, we prove the theorem for c = , . 

(l + e)v^5(l + e) 

Proof. First, we define y = [0, 1] and I. Given y G [0, 1], we denote by (yi, j/ 2 , • ■ • ) 
its dyadic expansion, that is, the unique sequence not ending with infinitely many 
zeros such that 



y=^Vk‘i ^ ■ 

fc>i 

Now, the loss function is defined as £(fc, y) = y^ for all y G 3^ and fc G N. 

We construct a random outcome sequence and show that the expected value 
of the regret (with respect both to the random choice of the outcome sequence 
and to the forecaster’s possibly random choices) for any possibly randomized 
forecaster is bounded from below by the claimed quantity. 

More precisely, we denote by f7i , . . . , C/„ the auxiliary randomization which 
the forecaster has access to. Without loss of generality, it can be taken as an 
i.i.d. sequence of uniformly distributed random variables over [0, 1]. Our under- 
lying probability space is equipped with the cr-algebra of events generated by 
the random outcome sequence Fi , . . . ,Yn and by the randomization U\, . . . ,Un- 
The random outcome sequence is independent of the auxiliary randomization: 
we define N different probability distributions, i = 1, . . . , iV, formed by 

the product of the auxiliary randomization (whose associated probability distri- 
bution is denoted by P^) and one of the N different probability distributions 
Pi, . . . ,Pat over the outcome sequence defined as follows. 

For i = 1, . . . , N, Qi is defined as the distribution (over [0, 1]) of 

Z*2~^ + Zk2~^ + , 

k=l,... ,N, k^i 

where U, Z* , Z\, . . . ,Zj^ are independent random variables such that U has 
uniform distribution, and Z* and the have Bernoulli distribution with pa- 
rameter 1/2 — e for Z* and 1/2 for the Now, the randomization is such that 
under P^, the outcome sequence Id,... , Idi is i.i.d. with common distribution 
Qi. 

Then, under each P^ (for i = 1, . . . , N), the losses l{k, Yt), k = 1, . . . ,N, 
t = 1, . . . ,n, are i.i.d. Bernoulli random variables. In addition, £(i,Yt) = 1 with 
probability 1/2 — e and £{k,Yt) = 1 with probability 1/2 for each k i, where 
£ is a positive number specified below. 
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We have 



max Eyi-Ln 
yi, -- ,yn V 



,N 



max max ( — Lin 
Vl,--- ,yn ,N \ ’ . 



> max Ei 
,N 



E^L„ 




where E^ (resp. E^i) denotes expectation with respect to Pi (resp. P^)- 

Now, we use the following decomposition lemma, which states that a ran- 
domized algorithm performs, on the average, just as a convex combination of 
deterministic algorithms. The simple but cumbersome proof is omitted from 
this extended abstract. 

Lemma 3. For any given randomized forecaster, there exists an integer D, a 
point a = («!,... ,aD) G in the probability simplex, and D deterministic 
algorithms (indexed by a superscript d = 1, . . . , D) such that, for every t and 
every possible outcome sequence y\~^ = {yi, . . . ,yt-i). 



D 

Pa [It = i\ vl~^] I yj-d ’ 

d^l 

where is the indicator function that the d-th deterministic algorithm 

chooses action i when the sequence of past outcomes is formed by ■ 

Using this lemma, we have that there exist D, a and D deterministic sub- 
algorithms such that 



max Ej 
z=l,... ,iV 







= max E^ 

2=1,... ,iV 



n D N 






D 



d=l fc=l 
a N 



= max 
2 = 1 ,... 






d=l 



y^ y^ I ^t) ^2,1 






Now, under the regret grows by ^ whenever an action different from i is chosen 
and remains the same otherwise. Hence, 



max E,- 

2=1,... ,iV 



Li ' 



= max 

2=1,... ,N 



D 

E 

d=l 






■ n N 






t=l fe=l 



D n 



= £ max 
2 = 1 ,... ,N 



^ad^P, [p^i] 



= en 



1 — min 

2 = 1 N 



D 



EE>[^- 



d=l i=l 




For the d-th deterministic subalgorithm, let 1 < Tf < . . . < Tfn < n be the times 
when the m queries were issued. Then Tf , . . . , T((, are finite stopping times with 
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respect to the i.i.d. process Yi, . . . , Yn. Hence, by a well-known fact in probability 
theory (see, e.g., [5, Lemma 2, page 138]), the revealed outcomes Yrpd , . . . , Y^p^ 
are independent and indentically distributed as Yi. 

Let Rf be the number of revealed outcomes at time t and note that Rf is 
measurable with respect to the random outcome sequence. Now, as the subal- 
gorithm we consider is deterministic, Rf is fully determined by Y-pd, . . . ,Ypd . 
Hence, if may be seen as a function of Ypd,... ,Yp^ rather than a function 
of Ypd , . . . , Ypd only. This essentially means that the knowledge of the extra 

values cannot hurt in the sense that it cannot lead the forecaster to choose dif- 
ferent actions. As the joint distribution of Ypd , . . . , Ypd under is Q™, we have 
proven indeed that 



F,[lf = z] = Qf^ilf = i] . 



Consequently, our lower bound rewrites as 



max Ei 

,N 







= en 



D 

\ ^ \ ^ 



Oid,. 



1- min > 

d=l t=l 



By the generalized Fano’s lemma (see Lemma 5 in the Appendix), it is guaran- 
teed that 



D 



min 

i=l,... ,N ‘ 



t=l 



EE^'Qr[/" = *]<max 



K 



1 -I- e ln(iV — 1) 



where 



D N 



^ = EEE 



OCd 









1 ^ 

KL(Q™,Qr) = ^^EKL(Qr.Qr) 



i=2 



and KL is the Kullback-Leibler divergence (or relative entropy) between two 
probability distributions. 

Moreover, Bp denoting the Bernoulli distribution with parameter p, 



KL(Qr, QD = mKL(Qi,Qi) < m (KL (Bi/ 2 _„Bi/ 2 ) + KL (Bi/ 2 ,Bi/ 2 _,)) 

4e 



= m £T In 1-1- 



1 - 2e 



< 5m 



for 0<e<l/10, where the first inequality holds by noting that the definition of 
the Qi implies that the considered Kullback-Leibler divergence is upper bounded 
by the Kullback-Leibler divergence between (Zi, . . . , Z *, . . . , Z„, U), where Z* 
is in the f-th position, and {Z*, Z2 ■ ■ ■ , Zn,U). Therefore, 

max ( EyiL„ — min Lin] > sn ( 1 — max 
1 / 1 , ,yn V i=i,... ,N ’ J y 



5m 



1 -I- e ’ ln(A^ — 1) 
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Algorithm Label efficient zero-threshold Winnow 
Parameters rj > 0 

Initialization Wi^i — 1 for i = 1, . . . , N 
Fort = 1,2... 

1 . get xt € K'*, define by = Wi^tjWt-, where Wt = and let 

qt=Pt' Xt 

2 . predict with yt = sgn(gt) 

3. draw a Bernoulli variable Zt of parameter (2 |gt|/7 + 1)”^- 

4. if Zt = 1, then 

a) get yt € {-1,1}. 

b) if yt i=- yt-, then let Wi,t+i = Wi^t e’’*'*^*’* for alH = 1, . . . , A 

5. else, u)i,f+i = wt.t for alH = 1, . . . , A. 



Fig. 3. The randomized label-efficient zero-threshold Winnow. 



The choice 



yields the claimed bound. 



e = 



/eln(A- 1) 
5(1 -b e)m 



6 A Label Efficient Algorithm for Pattern Classification 

So far, we have shown that exponentially weighted average forecasters can be 
made label efficient without losing important properties, such as Hannan con- 
sistency. In this section we move away from the abstract sequential decision 
problem defined in Section 2 and show that the idea of label efficient prediction 
finds interesting applications in more concrete pattern classification problems. 
More specifically, consider the problem of predicting the binary labels of an arbi- 
trarily chosen sequence Xi,X 2 , . . . € of instances where, for each t = 1,2, , 
the label yt G (—1, 1} of Xt satisfies yt u ■ Xt > 0. Here tt € is a fixed but un- 
known linear separator for the labeled sequence. In this framework, we show that 
the zero-threshold Winnow algorithm of Littlestone [10], a natural extension to 
pattern classification of the exponentially weighted average forecaster, can be 
made label efficient. In particular, for the label efficient variant of this algorithm 
(described in Figure 3) we prove an expected mistake bound exactly equal to 
the mistake bound of the original zero-threshold Winnow. In addition, unlike the 
algorithms shown in previous sections, in our variant the probability of querying 
a label is a function of the previously observed instances and previously queried 
labels. 

Theorem 4. Pick any sequence (xi,yi), . . . , (a?„, j/„) G x (—1, 1} such that, 
for all t = 1,. . . ,n, ytu ■ Xt > j for some 7 > 0 and some vector u from 
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the probability simplex in Let be any number such that max* ||a;i||^ < 
Xoo- Then the randomized label efficient zero-threshold Winnow algorithm of 
Figure 3 , run with parameter rj = 7/X^, makes an expected number of mistakes 
bounded by {2 Xf^ln. N) while querying an expected number of labels equal to 
ELi(2|?t|/7+l)-^ 

The dependence of 77 on 7 is inherited from the original Winnow algorithm and 
is not caused by the label efficient framework. Note also that, while the expected 
mistake bound is the same as the mistake bound for the original zero-threshold 
Winnow, the probability of querying a label at step t attains 1 as the “margin” 
|gi| shrinks to 0, and attains (2 Xoo/^ + 1)”^ as |<7t| grows to its maximum value 
Xao- Obtaining an explicit bound on the expected number of queried labels 
appears hard as qt depends in a complicated way on the structure of the labeled 
sequence. Hence, the result demonstrates that the label efficient framework in 
this case does provide an advantage (in expectation), even though the theoretical 
assessment of this advantage appears to be problematic. 

Proof. Let Mt be the indicator function for a mistake in step t. Pick a step t 
such that Mt and Zt are both 1. Then, 

^ < iiytPt ■ xt + = -77 \qt\ + \^lo 

where the inequality is an application of the Hoeffding inequality [9] while the 
last equality holds because Mt = 1 implies ytqt < 0. On the other hand, if Mt 
or Zt is 0 at step t, then Wt+i = Wt and thus ln(Wt+i/lTt) = 0. Summing for 
t = 1 , .. . , n we get 

\n^^<p^[^Xl-\qt\)MtZt (2) 

Now consider any vector u of convex coefficients such that ytu • a?* > 7 for all 
t = 1 , .. . ,n. Let 

n 

'^{ytXt)MtZt . 
t=i 

Using the log-sum inequality [6], and recalling that ytU ■ Xt> ^ for all t, 

W ^ 

In — = — In + In e'' > — In iV -|- 77 i? • it -l- iL ( m) 

i—\ 
n 

> - In iV -I- 777 Mt Zt -I- H{u) . 

t=i 

Dropping H{u) > 0, the entropy of u, from (2) and (3) we obtain 

n n 

- In TV + 777 y]MtZt< 77 y](|x^-|(7t|)MtZt . 



( 3 ) 
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Dividing by 77 > 0 and rearranging yields 

Replacing rj with 7 /X^ gets us 

+ (4) 

t=i ^ ^ 

Now recall that E[Z( | Zi,. . . , Zt-\] = {2\qt\/"i + l)~^ , where the conditioning 
is needed as g* is a function of Zi, . . . , Zt_i. Taking expectation on both sides 
of (4) yields 



7 



> E 






= E 



n 

Y.[l + \qt\)MtnZt 



^1) • 






t-lj 



Li=l J 

n 

t=i 

Multiplying both sides by 2/y gets us the desired result. 



= E 






Mf 



2|gt|/7+ 1 



= ^E 
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A Technical Lemmas 

The crucial point in the proof of the lower bound theorem is an extension of 
Fano’s lemma to a convex combination of probability masses, which may be 
proved thanks to a straightforward modification of the techniques developed by 
Birge [3] (see also Massart [11]). Recall first a consequence of the variational 
formula for entropy. 

Lemma 4. For arbitrary probability distributions P, Q and for each A > 0, 

AP[A]-^q[^](A) <KL(P,Q) 
where tpp{\) = In {p (e^ — 1) -|- l) . 



Lemma 5 (Generalized Fano). Let {Agj : s = 1, . . . , S, j = 1, . . . , N} be a 
family of subsets of a set fl such that As^i , . . . , Rg.iv form a partition of fi for 
each fixed s. Let a\, . . . ,as be such that Os > 0 for s = 1, . . . ,S and a\ + 
... -b as = 1. Then, for all sets Ps,i, • • • , ^s,n, s = 1, . . . , S, of probability 
distributions on fl. 



S r 

min vE asf’s,j[As,j] < max<^ 

^ ’ ’ S—1 ^ 



K 



1 -b e ’ ln(7V — 1) J ’ 



where 



S N 



^ = EE 



j=2 

Proof. Using Lemma 4, we have that 



iV- 1 



KL(Psj,Psq) . 



S N S N 



j=2 



s=l j=2 



S N 






j=2 



Now, for each fixed A > 0, the function that maps p to —tpp{X) is convex. Hence, 
letting 



S N / S \ 

~ E E jY _ = N — 1 ( ^ ~ E I 5 

s—lj—2 \ s—1 / 
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by Jensen’s inequality we get 
S N 



s=l j=2 



S N S N 

s EE - EE 



s=l j=2 



s=l j=2 



Recalling that the right-hand side of the above inequality above is less than K, 
and introducing the quantities 



= ^ asPs,j [^s,j] for j = 1, . . . ,N, 



S = 1 



we conclude 



N 



A min a,- — tbi-a-i (A) < A— — - a, — (A) < K . 

j=2 

Denote by a the minimum of the aj’s and let p* = {1 — a)/{N — 1) > pi- We 
only have to deal with the case when a > e/(l -I- e). As for all A > 0, the function 
that maps p to —'4’p is decreasing, we have 

K > sup (A a — '!/’?)• (A)) > a In > a In — E > q ln(A^ — 1) , 

A>o ep* (l-a)e 

whenever p* < a <1 for the second inequality to hold, and by using a > e/(l-|-e) 
for the last one. As p* < 1 / {N — 1) < e/(l + e) whenever N >3, the case a < p* 
may only happen when N = 2, but then the result is trivial. 
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Abstract. We study the problem of classifying data in a given taxonomy 
when classifications associated with multiple and/or partial paths are al- 
lowed. We introduce an incremental algorithm using a linear-threshold 
classifier at each node of the taxonomy. These classifiers are trained and 
evaluated in a hierarchical top-down fashion. We then define a hierachi- 
cal and parametric data model and prove a bound on the probability 
that our algorithm guesses the wrong multilabel for a random instance 
compared to the same probability when the true model parameters are 
known. Our bound decreases exponentially with the number of train- 
ing examples and depends in a detailed way on the interaction between 
the process parameters and the taxonomy structure. Preliminary exper- 
iments on real-world data provide support to our theoretical results. 



1 Introduction 

In this paper, we investigate the problem of classifying data based on the knowl- 
edge that the graph of dependencies between class elements is a tree forest. The 
trees in this forest are collectively interpreted as a taxonomy. That is, we as- 
sume that every data instance is labelled with a (possibly empty) set of class 
labels and, whenever an instance is labelled with a certain label i, then it is also 
labelled with all the labels on the path from the root of the tree where i occurs 
down to node i. We also allow multiple-path labellings (instances can be tagged 
with labels belonging to more than one path in the forest), and partial-path 
labellings (instances can be tagged with labels belonging to a path that does not 
end on a leaf) . 

The problem of hierarchical classification, especially of textual information, 
has been extensively investigated in past years (see, e.g., [5,6,7,11,12,13,17, 
19] and references therein). Whereas the use of hierarchically trained linear- 
threshold classifiers is common to several of these previous approaches, to our 

* The first and third author gratefully acknowledge partial support by the PASCAL 
Network of Excellence under EC grant no. 506778. 
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knowledge our research is the first one to provide a rigorous performance analy- 
sis of hierarchical classification problem in the presence of multiple and partial 
path classifications. 

Following a standard approach in statistical learning theory, we assume that 
data are generated by a parametric and hierarchical stochastic process associ- 
ated with the given taxonomy. Building on the techniques from [3], we design 
and analyze an algorithm for estimating the parameters of the process. Our al- 
gorithm is based on a hierarchy of regularized least-squares estimators which are 
incrementally updated as more data flow into the system. We prove bounds on 
the instantaneous regret; that is, we bound the probability that, after observing 
any number t of examples, our algorithm guesses the wrong multilabel on the 
next randomly drawn data element, while the hierarchical classifier knowing the 
true parameters of the process predicts the correct multilabel. Our main con- 
cern in this analysis is stressing the interaction between the taxonomy structure 
and the process generating the examples. This is in contrast with the standard 
approach in the literature about regret bounds, where a major attention is paid 
to studying how the regret depends on time. 

To support our theoretical findings, we also briefly describe some experi- 
ments concerning a more practical variant of the algorithm we actually analyze. 
Though these experiments are preliminary in nature, their outcomes are fairly 
encouraging. 

The paper is organized as follows. In Section 2 we introduce our learning 
model, along with the notational conventions used throughout the paper. Our 
hierarchical algorithm is described in Section 3 and analyzed in Section 4. In 
Section 5 we briefly report on the experiments. Finally, in Section 6 we summarize 
and mention future lines of research. 



2 Learning Model and Notation 

We assume data elements are encoded as real vectors x G which we call 
instances. A multilabel for an instance x is any subset of the set {!,... , c} of all 
labels, including the empty set. We represent the multilabel of x with a vector 
V = (vi, . . . , Vc) G { — 1, 1}'^, where i belongs to the multilabel of x if and only if 
Vi = 1. A taxonomy G is a forest whose trees are defined over the set of labels. 
We use j = par(z) to denote the unique parent of i and ANC(i) to denote the 
set of ancestors of i. The depth of a node i (number of edges on the path from 
the root to z) is denoted by hi. 

A multilabel v belongs to a given taxonomy if and only if it is the union of one 
or more paths in the forest, where each path must start from a root but need not 
terminate on a leaf (see Figure 1). A probability distribution fa over the set of 
multilabels is associated to a taxonomy G as follows. Each node z of G is tagged 
with a { — 1, l}-valued random variable Vj distributed according to a conditional 
probability function F{Vi \ FpAR(q, X) To model the dependency between the 
labels of nodes z and j = par(z) we assume F {Vi = l\Vj = —1, X = x) =0 
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Fig. 1. A forest made up of two disjoint trees. The nodes are tagged with the 
name of the labels, so that in this case c = 11. According to our definition, the 
multilabel v — (1, 1, 1, — 1, — 1, 1, — 1, 1, — 1, 1, — 1) belongs to this taxonomy (since it 
is the union of paths 1 — >■ 2, 1 — >■ 3 and 6 — >■ 8 — >■ 10), while the multilabel 
V = (1, 1, —1, 1, —1, —1, —1, —1, —1, —1, —1) does not, since 1 — >■ 2 — >■ 4 is not a path in 
the forest. 

for all nonroot nodes i and all instances x. For example, in the taxonomy of 
Figure 1 we have P (V 4 = 1 | V 3 = — 1, X = a;) = 0 for all a; G The quantity 

fc{v I x) = P (V, = Vi I Vj = Vj, j = par(i), X = x) 

thus defines a joint probability distribution on Vi , . . . ,Vc conditioned on x being 
the current instance. 

Through fa we specify an i.i.d. process {(Xi, V\), (X 2 , V 2 ), . . . }, where, for 
t = 1,2, , the multilabel Vt is distributed according to the joint distribution 
/g(- I Xj) and X( is distributed according to a fixed and unknown distribution 
D. We call each realization (xt,Vt) of (Xt,Vt) an example. 

We now introduce a parametric model for Jq. First, we assume that the 
support of D is the surface of the d-dimensional unit sphere (in other words, 
instances a; G are normalized, so that ||a;|| = 1). With each node i in the 
taxonomy, we associate a unit-norm weight vector Ui G Then, we define the 
conditional probabilities for a nonroot node i with parent j as follows: 

^{V, = l\ Vj = l,X = x) = {l + uJx)/2 . (1) 

If f is a root node, the above simplifies to 

P(R, = 1 \ X = x) = {l + u]x)/2 . 

Note that, in this model, the labels of the children of any given node are inde- 
pendent random variables. This is motivated by the fact that, unlike previous 
investigations, we are explicitely modelling labellings involving multiple paths. 
A more sophisticated analysis could introduce an arbitrary negative correlation 
between the labels of the children nodes. We did not attempt to follow this route. 

In this parametric model, we would like to perform almost as well as the 
hierarchical predictor that knows all vectors Ui, . . . ,Uc and labels an instance x 
with the multilabel y = (j/i, . . . , yc) computed in the following natural top-down 
fashion:^ 



1 



SGN denotes the usual signum function: SGn(j;) = 1 if ® > 0; —1, otherwise. 
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{ SGN{uJ x) if i is a root node, 

SGN{uJ x) if i is not a root and yj = +1 for j = PAR(f), (2) 

— 1 if i is not a root and yj = — 1 for j = par(i) . 

In other words, if a node has been labelled +1 then each child is labelled accord- 
ing to a linear-threshold function. On the other hand, if a node happens to be 

labelled —1 then all of its descendants are labelled —1. 

For our theoretical analysis, we consider the following on-line learning model. 
In the generic time step t = 1,2,... the algorithm receives an instance x^ (a 
realization of Xt) and outputs c binary predictions y 2 ,t, • ■ • , ilc,t G {~1) +1}> 
one for each node in the taxonomy. These predictions are viewed as guesses for 
the true labels v\^t, V 2 ,t, ■ ■ ■ , Vc,t (realizations of Vi^t, V 2 ,t, ■ ■ ■ , Vc,t, respectively) 
associated with Xt. After each prediction, the algorithm observes the true labels 
and updates its estimates of the true model parameters. Such estimates will then 
be used in the next time step. 

In a hierarchical classification framework many reasonable accuracy measures 
can be defined. As an attempt to be as fair as possible,^ we measure the accuracy 
of our algorithm through its global instantaneous regret on instance Xt, 

: yi,t ^ Vi^t) - lP(3i : yt^t ^ Vi^t) , 

being yt t the f-th label output at time t by the reference predictor (2). The 
above probabilities are w.r.t. the random draw of {Xi, Vi), . . . , {Xt,Vt)- The 
regret bounds we prove in Section 4 are shown to depend on the interaction 
between the structure of the multi-dimensional data-generating process and the 
structure of the taxonomy on which the process is applied. 

Further notation. We denote by {4>} the Bernoulli random variable which is 1 
if and only if predicate 4> is true. Let ip be another predicate. We repeatedly 
use simple facts such as {(pW ip} = {<p} + {tp, -k()} < {(p} + {ip} and {(p} = 
{(p Alp} + {(p A ->ip} < {(pAip} + {-'Ip}. 

3 The Learning Algorithm 

We consider linear-threshold algorithms operating on each node of the taxonomy. 
The algorithm sitting on node i maintains and adjusts a weight vector W i^t 
which represents an estimate at time t of the corresponding unknown vector Ut . 

Our hierarchical classification algorithm combines the weight vectors Wi^t 
associated to each node in much the same way as the hierarchical predictor (2) . 
However, since Ui parameterizes a conditional distribution where the label as- 
sociated with the parent of node i is 1 — recall (1), it is natural to update Wi^t 
only when such a conditioning event actually occurs. The pseudocode of our 
algorithm is given in Figure 2. 

^ It is worth mentioning that the machinery developed in this paper could also be 
used to analyze loss functions more sophisticated that the 0-1 loss. However, we will 
not pursue this more sophisticated analysis here. 
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Initialization: Weight vectors Wi^i = (0, . . . , 0), i = 1, . . . , c. 
For t = 1,2,... do: 



1. Observe instance Xt; 

2. Compute prediction values € {~1, 1} as follows: 



f SGN(FFi'^tXt) if i is a root node, 

SGN(Wi^tXt) if i is not a root node and = +1 for j = par(*), 

— 1 if i is not a root node and = — 1 for j = par(*), 

Wi,t = (7 + Si,t-xSlt-i + XtXj 
1 — {Vi, ill Vi, 12^ 5 • • • 5 ) 

Si,t—\ — X i2 • • • — 1 ) ] ’ ^ 1 , . . . , C , 

3. Observe multilabel Vt and perform update. 



— 



where 



Fig. 2. The hierarchical learning algorithm. 



Given the i.i.d. process Xi, X 2 , ■ ■ ■ generating the instances, for each node i 
we define the derived process Xi,, Xi^, . . . including all and only the instances 
Xs of the original process that satisfy VpAR(i),s = 1- We call this derived process 
the process at node i. Note that, for each i, the process at node i is an i.i.d. 
process. However, its distribution might depend on i; that is, the process dis- 
tribution at note i is generally different from the process distribution at node 
J 7 ^ *• 

Let N{i, t) denote the number of times the parent of node i observes a positive 
label up to time t, i.e., N{i,t) = |{1 < s < t : Vpar(i),s = 1}|- The weight 
vector W i^t stored at time t in node i is a (conditional) regularized least squares 
estimator given by 



= {I + Si,t-iSlt_, + XtXj)-^S,^t-iV,^t-i 



(3) 



where I is the d x d identity matrix, is the d x N{i,t — 1) ma- 

trix whose columns are the instances Xi,,... ,Xi^^,,_,., and Vi^t-i = 
{Viy,, . . . , is the N(i, f — l)-dimensional vector of the corresponding 

labels observed by node i. 

This estimator is a slight variant of regularized least squares for classifica- 
tion [2,15] where we include the current instance Xt in the computation of Wi^t 
(see, e.g., [1,20] for analyses of similar algorithms in different contexts). Efficient 
incremental computations of the inverse matrix and dual variable formulations 
of the algorithm are extensively discussed in [2,15]. 



4 Analysis 

In this section we state and prove our main result, a bound on the regret of our 
hierarchical classification algorithm. In essence, the analysis hinges on proving 
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that for any node i, the estimated margin WJ^Xt is an asymptotically unbiased 
estimator of the true margin uj Xt, and then on using known large deviation 
arguments to obtain the stated bound. For this purpose, we bound the variance 
of the margin estimator at each node and prove a bound on the rate at which 
the bias vanishes. Both bounds will crucially depend on the convergence of the 
smallest empirical eigenvalue of the process at each node i, and the next result 
is the key to keeping this convergence under control. 

Lemma 1 (Shawe- Taylor et al. [18]). Let X = (Xi,... ,Xd) G be a 
random vector such that ||X|| = 1 with probability 1, and let X > 0 be the 
smallest eigenvalue of the correlation matrix {E[Xi Xj]}‘l^ If Xi,... ,Xg 
are i.i.d. random vectors distributed as X , S is the dx s matrix whose columns 
are Xi,... ,Xs, C = S is the associated empirical correlation matrix, and 
As > 0 is the smallest eigenvalue of C , then 

P(^ < A/2)< 2(s + 1) provided s > %U/X^ . (4) 

We now state our main result. 

Theorem 1. Consider a taxonomy G with c nodes of depths hi,... ,hc and 
fix an arbitrary choice of parameters U\,... ,Uc G such that ||Mi|| = 1, 
i = 1, . . . , c. Assume there exist 7 i, . . . , 7 c > 0 such that distribution D satisfies 
P (^\uj Xt\ > 7 i) = 1, i = 1, . . . , c. Then, for all 

2hi+i 8 96d\ 

,c ¥{a^ x^i ’ ¥{a ^ ^ j 

the regret at time t of the algorithm described in Figure 2 satisfies 



t > max 



max 



P(3i : i/i,t ^ - P(3i : yi^t ^ Vi^t) 

^fX,{t-l)nM,t) 






i=l 



2 et exp I — 



16 • 2^*+i 



+ e(t+ 1)^ exp — 



Af (t-l)F(A.,i) 
304-2'**+i 



exp 



f (f-l)P(A 7 ) \ 
^ 5 • 2'‘>+i J 



where Ai.t = {Vj G ANC(i) : uJ Xt > 0} and Xi is the smallest eigenvalue of 
the process at node i. 

Remark 1. Note that the dependence of P(^i,t) on t is purely formal, as evinced 
by the definition of Ai^t- Hence, the regret vanishes exponentially in t. This 
unnaturally fast rate is mainly caused by our assumptions on the data and, 
in particular, on the existence of 71, . . . ,7c constraining the support of D. As 
shown in [3], we would recover the standard rate by assuming, instead, some 
reasonable bound on the tail of the distribution of the inverse squared margin 
{uJ Xt)~'^, though this would make our analysis somewhat more complicated. 
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Remark 2. The values express the main interplay between the tax- 

onomy structure and the process generating the examples. It is important to 
observe how our regret bound depends on such quantities. For instance, if we 
just focus on the probability values ¥{Ai^t), we see that the regret bound is 
essentially the sum over all nodes i in the taxonomy of terms of the form 



F{Ai^t)exp{-kiF{Ai^t)t) , (5) 

where the ki’s are positive constants. Clearly, F{Ai^t) decreases as we descend 
along a path. Hence, if node z is a root then ¥{Ai^t) tends to be relatively large, 
whereas if z is a leaf node then ¥{Ai^t) tends to be close to zero. In both cases 
(5) tends to be small: when F{Ai.t) is close to one it does not affect the negative 
exponential decrease with time; on the other hand, if F(Ai^t) is close to zero then 
(5) is small anyway. In fact, this is no surprise, since it is a direct consequence of 
the hierarchical nature of our prediction algorithm (Figure 2). Let us consider, 
for the sake of clarity, two extreme cases: 1) z is a root node; 2) z is a (very deep) 
leaf node. 

1) A root node observes all instances. The predictor at this node is required 
to predict through SGN{WJ^^Xt) on all instances Xf, but the estimator W 
gets close to Ui very quickly. In this case the negative exponential convergence 
of the associated term (5) is fast (P(Ai,t) is “large”). 

2) A leaf node observes a possibly small subset of the instances, but it is 
also required to produce only a small subset of linear-threshold predictions (the 
associated weight vector Wi^t might be an unreliable estimator, but is also used 
less often). Therefore, in this case, (5) is small just because so is P(Ai,t)- 

In summary, F(Ai^t) somehow measures both the rate at which the estima- 
tor in node z gets updated and the relative importance of the accuracy of this 
estimator when computing the overall regret. 



Remark 3. The bound of Theorem 1 becomes vacuous when Aj = 0 for some z. 
However, note that whenever the smallest eigenvalue of the original process (i.e., 
the process at the roots) is positive, then > 0 for all nodes z, up to pathological 
collusions between D and the it^’s. As an example of such collusions, note that 
the process at node z is a filtered version of the original process, as each ancestor 
j of i filters out Xt with probability depending on the angle between Xt and 
Uj. Hence, to make the process at node z have a correlation matrix with rank 
strictly smaller than the one at j = par(z), the parameter Uj should be perfectly 
aligned with an eigenvector of the process at node j. 

Remark 4- We are measuring regret against a reference predictor that is not 
Bayes optimal for the data model at hand. Indeed, the Bayes optimal predictor 
would use the maximum likelihood multilabel assignment given G and Ui, . . . ,Uc 
(this assignment is easily computable using a special case of the sum-product al- 
gorithm [10]). Finding a good algorithm to approximate the maximum-likelihood 
assignment has proven to be a difficult task. 
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Proof (of Theorem 1). We first observe that 

{3^ ■ i}i,t ^ ■ Vi^t ^ ^,t} “1“ ’ Vi,t ^ Ui^t} 

= : yi^t ^ 

c 

+ %,i = %,(. i = 1, ■ • ■ , * - 1} ■ (6) 

Without loss of generality we can assume that the nodes in the taxonomy are 
assigned numbers such that if node t is a child of node j then i > j. The regret (6) 
can then be upper bounded as 

C 

^ y^,t, Vj,t = yj,t, j = 1,... ,i-i} 

c 

- ^ ^ ANC(i) : ijj^t = yj^t} 

c 

= T.{y^,t + yi,t, Vj G ANC(i) : yyt = yj^t = 1} 

i=l 

(since yj^t = yj,t = — 1 for some ancestor j implies yi^t = yi,t = ~1) 

c 

- ^ yi,t^ € ANC(i) : yj^t = 1} ■ 

i=l 

Taking expectations we get 

P(3z : yi,t ^ C,t) - lP(3t : yf V*,t) 

C 

< ^ ANC(t) : yj^t = 1) ■ 

i=l 

We now bound from above the simpler probability terms in the right-hand side. 
For notational brevity, in the rest of this proof we will be using Ai^t to denote 
the margin variable uj Xt and Ai^t to denote the algorithm’s margin WjfXt- 
As we said earlier, our argument centers on proving that for any node i, Ai^t 
is an asymptotically unbiased estimator of Ai^ti and then on using known large 
deviation techniques to obtain the stated bound. For this purpose, we need to 
study both the conditional bias and the conditional variance of Ai^f 

Recall Figure 2. We first observe that the multilabel vectors Vi, . . . , Vt-i 
are conditionally independent given the instance vectors Xi,... ,Xt_i. More 
precisely, we have 

P(Vi,... , V _1 I Xi,... ,Xi_i) =P(Vi I Xi) X ... xP(Ft_i I Xt_i) . 

Also, for any given node i with parent j, the child’s labels . . . , 

are independent when conditioned on both Xi, . . . , Xt-i and the parent’s labels 

Vj^i , . . . , Let us denote by the conditional expectation 

E[. I (Xi,R,-i),... ,(Xt_i,R,-t_i), X*] . 
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By definition of our parametric model (1) we have Re- 

calling the definition (3) of W i^t, this implies 






T\-l 






In the rest of the proof, we use to denote the smallest eigenvalue of 

the empirical correlation matrix The conditional bias is bounded in 

the following lemma (proven in the appendix). 

Lemma 2. With the notation introduced so far, we have: Ai^t = 
where the conditional bias Bi^t satisfies Bi t < 2/(1 -|- 

Next, we consider the conditional variance of Ai^f Recalling Figure 2, we see 
that 

Ai^t = ^ hj.ifc Zk 

fc=i 

where = (A, = Slt-i{l + S,^t-iSlt-i + XtXjy' Xf 
The next lemma (proven in the appendix) handles the conditional variance ||Z||^. 



Lemma 3. With the notation introduced so far, we have: \\Z\\'^ < l/(2-|-Ai_t-i)- 

Armed with these two lemmas, we proceed through our large deviation argument. 
For the sake of brevity, denote N{i,t — 1) by N. Also, in order to stress the 
dependence of \i^t-i, Ai^t and Bi^t on N{i,t — 1), we denote them by Xi^N, 
Ai^t,N and respectively. The case when subscript N is replaced by its 

realization n should be intended as the random variable obtained by restricting 
to sample realizations such that N takes on value n. Thus, for instance, any 
predicate 4>{Ai^t,n) involving Ai^t,n should actually be intended as a short-hand 
for 4>{Ai^t,N) AN = n. 

Recall that = {Vj € ANC(t) : uj Xt > 0} = {Vj G ANC(i) : = 1} • 

We have 

yi,t^ 

< Ai^t < 0 , 

^ I 

< + Bi^N — Ai^t\ > \Ai^t\ — |A,a|, A,i| 

^ + Bi^iq — Ai^t\ > \Aifi/2, I -|- {|Ri,Ar| > \Aifi/2, Ai^t} ■ 

( 7 ) 
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We can bound the two terms of (7) separately. Let M < t be an integer 
constant to be specified later. For the first term we obtain 

^ + Bi^Pf — Ai^t\ ^ N>M, > Ai7V/2| 

+ N > M, Xi^M < AifV/2 j- + N < ilL} 

i-l 

< + Bi^n — Ai^t\ > |L\ijt|/2, Ai^t, Xi^n > Ain/2| 

n—M 
t-1 

+ ^ Xi^n < Ain/2j- + {Ai^t, XI < Af} . 

n—M 

For the second term, using Lemma 2 we get 
||73i,Ar| > |Z\i^t|/ 2 , 

< I 1 > \A^,t\/2, A^,t} 

< J ^ > I A, i 1/2, A^,u N>M, A,.jv > KN/2 \ 

[ 1 + Xi^N ) 

+ {A,u N>M, X,,n < AW/2} + {W.t, N <M} . 

Now note that the choice M > 8 /(Ai 7 i) > 8 /(Ai|Z\i_t|) makes the first term 
vanish. Hence, under this condition on M, 

{|S.,iv| > |A.i|/2, A,t} < {A.i, N>M, Xi,N < AW/2} + {W.t, N <M} 

t-1 

< Xi^n < Ain/ 2 } + {W,t, Af < M} . 

n—M 

Plugging back into (7) and introducing probabilities yields 

t-1 

^ ^ (^\^i,t,n + ^i,n ~ ^i,t\ ^ |^i,t|/ 2 , Ai^t^ ^i,n ^ ^*77-/2^ ( 8 ) 

n—M 

t-1 

+ 2 P Ai^ri < Ain/ 2 } (9) 

n—M 

+ 2 P(^,,t, N < M) . ( 10 ) 

Let j = PAR(i) and Pi^t denote P( • | (Xi, Vj_i), . . . , (Xt_i, Xt). Notice 

that Vi,ij, . . . , Vi^i„j. are independent w.r.t. Pi_i. We bound (8) by combining 
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Chernoff-Hoeffding inequalities [8] with Lemma 3: 

lPi,t + -Bi,n ~ \,n > Kn/2^ 

= {Ai^t} X > Ain/2| x — Ai^t\ > |L\i,t|/2^ 

< {A.t} X {a*,„ > A,n/2} X 2e-<‘(2+A,,„)/8 
<2 . 

Thus, integrating out the conditioning, we get that (8) is upper bounded by 

t-i 

2P(A,t) • 

n—M 

Since the process at each node i is i.i.d., we can bound (9) through the concen- 
tration result contained in Lemma 1. Choosing M > 96d/Af, we get 

Pi,t Xi^n < Xin/2^ = {Ai^t} Pi,t ^Ai,„ < Xin/2^ 

<2{n+\){A^t} . 

Thus, integrating out the conditioning again, we get that (9) is upper bounded 

by 



2P(A,t) Y («+ < P(A.t) (t + . 

n—M 

Finally, we analyze (10) as follows. Recall that N = counts the number 

of times node j, the parent of node i, has observed = 1 for s=l,...,t— 1. 
Therefore P N < M) = P (Ai^t) P {N < M), and we can focus on the latter 
probability. The random variable N is binomial and we can bound its parameter 
fii as follows. Let j(l) j(2) j{hi) ihe the unique path from a root 

down to node i (that is, ANC(i) = {j(l),--- ,j(hi)} and j{hi) = PAR(i)). Fix 
any X G such that ||AC|| = 1. Exploiting the way conditional probabilities 
are defined in our taxonomy (see Section 2), for a generic time step s < t — 1 we 
can write 



p(f"PARW,. = 1 1 X) = n p(E,(fc).« = 1 1 Vjik-ihs = 1,^) 



hi 

= n 

k^l 

hi 









(using (1)) 



{AiA ^ ( 2 ) ) 
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since Ai^t is equivalent to > 0 for k = 1,... ,hi. Integrating over X 

we conclude that the parameter fii of the binomial random variable N satisfies 



= P(V^AR(i),s = 1) > (i)^* ^{Ai^t) ■ We now set M as follows: 



M= > 



{t-l)F{A^,t) 

2hi + l 



This implies 



P(A.t, N <M)= P(A.t)P(A^ < M) 

< P(A.t) 

{t-l)F{A^,t] 



< V{Ai^t) exp - 



10 • 



( 11 ) 



where we used Bernstein’s inequality (see, e.g., [4, Ch. 8]) and our choice of M 
to prove (11). 

Piecing together, overapproximating, and using in the bounds for (8) and (9) 
the conditions on t, along with M > {t — l)P(^i^t)/2^‘+^ — 1 results in 



P(3i : Vi,t Vi^t) - P(3t : yi^t ^ h),i) 

C 

< ^P(2/i_t yf Ai^t) 

2=1 



2=1 



2 et exp 



7fA,(t-l)P(^,,t) 

16 • 2'**+! 



+ e (t + 1)^ exp 



Af (t-l)P(A7) \ 
304 • 2^*+i ) 



exp 



(t-l)P(A,t) \ 

5 . 2 '^i+i ) ’ 



thereby concluding the proof. 



□ 



5 Preliminary Experimental Results 

To support our theoretical results, we are testing some variants of our hierarchi- 
cal classification algorithm on real-world textual data. In a preliminary series of 
experiments, we used the first 40,000 newswire stories from the Reuters Corpus 
Volume 1 (RCVl). The newswire stories in RCVl are classified in a taxonomy 
of 102 nodes divided into 4 trees, where multiple-path and partial-path classi- 
fications repeatedly occur throughout the corpus. We trained our algorithm on 
the first 20,000 consecutive documents and tested it on the subsequent 20,000 
documents (to represent documents as real vectors, we used the standard tf-idf 
bag-of- words encoding — more details will be given in the full paper). To make 
the algorithm of Figure 2 more space-efficient, we stored in the estimator asso- 
ciated with each node only the examples that achieved a small margin or those 
that were incorrectly classified by the current estimator. In [3] this technique is 
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shown to be quite effective in terms of the number of instances stored and not 
disruptive in terms of classification performance. This space-efficient version of 
our algorithm achieved a test error of 46.6% (recall that an instance is considered 
mistaken if at least one out of 102 labels is guessed wrong) . For comparison, if we 
replace our estimator with the standard Perceptron algorithm [16,14] (without 
touching the rest of the algorithm) the test error goes up to 65.8%, and this 
performance does not change significantly if we train the Perceptron algorithm 
at each node with all the examples independently (rather than using only the 
examples that are positive for the parent) . For the space-efficient variant of our 
algorithm, we observed that training independently each node causes a moderate 
increase of the test error from 46.6% to 49.6%. Besides, hierarchical training is 
in general much faster than independent training. 



6 Conclusions and Ongoing Research 

We have introduced a new hierarchical classification algorithm working with 
linear-threshold functions. The algorithm has complete knowledge of the tax- 
onomy and maintains at each node a regularized least-squares estimator of the 
true (unknown) margin associated to the process at that node. The predictions 
at the nodes are combined in a top-down fashion. We analyzed this algorithm in 
the i.i.d. setting by providing a bound on the instantaneous regret, i.e., on the 
amount by which the probability of misclassification by the algorithm exceeds 
on a randomly drawn instance the probability of misclassification by the hierar- 
chical algorithm knowing all model parameters. We also reported on preliminary 
experiments with a few variants of our basic algorithm. 

Our analysis in Section 4 works under side assumptions about the distribu- 
tion D generating the examples. We are currently investigating the extent to 
which it is possible to remove some of these assumptions with no further techni- 
cal complications. A major theoretical open question is the comparison between 
our algorithm (or variants thereof) and the Bayes optimal predictor for our para- 
metric model. Finally, we are planning to perform a more extensive experimental 
study on a variety of hierarchical datasets. 

References 

1. K.S. Azoury and M.K. Warmuth. Relative loss bounds for on-line density estima- 
tion with the exponential familiy of distributions. Machine Learning, 43(3) :211" 
246, 2001. 

2. N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algo- 
rithm. In Proc. 15th COLT, pages 121-137. LNAI 2375, Springer, 2002. 

3. N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear- 
threshold classifiers via selective sampling. In Proc. 16th COLT, pages 373-386. 
LNAI 2777, Springer, 2003. 

4. L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recogni- 
tion. Springer Verlag, 1996. 




106 



N. Cesa-Bianchi, A. Conconi, and C. Gentile 



5. S.T. Dumais and H. Chen. Hierarchical classification of web content. In Proceed- 
ings of the 23rd ACM International Conference on Research and Development in 
Information Retrieval, pages 256-263. ACM Press, 2000. 

6. M. Granitzer. Hierarchical Text Classification using Methods from Machine Learn- 
ing. PhD thesis, Graz University of Technology, 2003. 

7. T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies; classifying 
documents and words. Nips 2003: Workshop on syntax, semantics, and statistics, 
2003. 

8. W. Hoeffding. Probability inequalities for sums of bounded random variables. 
Journal of the American Statistical Association, 58:13-30, 1963. 

9. R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985. 

10. F.R. Kschischang, B.J. Frey, and H. Loeliger, Factor graphs and the sum-product 
algorithm IEEE Trans, of Information Theory, 47(2): 498-519, 2001. 

11. D. Roller and M. Sahami. Hierarchically classifying documents using very few 
words. In Proc. Ifth ICML, pages 170-178. Morgan Kaufmann Publishers, 1997. 

12. A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classi- 
fication by shrinkage in a hierarchy of classes. In Proc. 15th ICML, pages 359-367. 
Morgan Kaufmann Publishers, 1998. 

13. D. Mladenic. Turning yahoo into an automatic web-page classifier. In Proc. 13th 
European Conference on Artificial Intelligence, pages 473-474, 1998. 

14. A. B.J. Novikov. On convergence proofs on perceptrons. Proc. of the Symposium 
on the Mathematical Theory of Automata, vol. XII, pp. 615-622, 1962. 

15. R. Rifkin, G. Yeo, and T. Poggio. Regularized least squares classification. In 
Advances in Learning Theory: Methods, Model and Applications. NATO Science 
Series III: Computer and Systems Sciences, volume 190, pages 131-153. lOS Press, 
2003. 

16. F. Rosenblatt. The perceptron: A probabilistic model for information storage and 
organization in the brain. Psychological Review, 65, 386-408, 1958. 

17. M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural net- 
works. Information Retrieval, 5(1):87-118, 2002. 

18. J. Shawe- Taylor, C. Williams, N. Cristianini, and J.S. Kandola. On the eigenspec- 
trum of the Gram matrix and its relationship to the operator eigenspectrum. In 
Proc. 13th ALT, pages 23-40. LNCS 2533, Springer, 2002. 

19. A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proc. 
2001 International Conference on Data Mining, pages 521-528. IEEE Press, 2001. 

20. V. Vovk. Competitive on-line statistics. International Statistical Review, 69:213- 
248, 2001. 



Appendix 



This appendix contains the proofs of Lemma 2 and Lemma 3 mentioned in the 
main text. Recall that, given a positive definite matrix A, the spectral norm of 
A, denoted by ||4l||, equals the largest eigenvalue of A. As a simple consequence, 
||A“^|| is the reciprocal of the smallest eigenvalue of A. 
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Proof of Lemma 2 

Setting A = I + we get 

+ uj{l + XtXj){A + XtXj )-^Xt 
= E,4Ai,t] + uJ{A + XtXj )-^Xt + Ai,t Xj (71 + XtXj )-^Xt.{12) 

Using the Sherman-Morrison formula (e.g., [9, Ch. 1]) and the symmetry of A, 
we can rewrite the second term of (12) as 



uj (A + XtXj )-^Xt = ul I 71-1 _ 



ujA-^XtXtA-^Xt 



l + X^ A-^Xt 



= ul A-^Xt - 

and the third term of (12) as 

Xj{A + XtXj)-^Xt = A,^ 



A-'^XtX^ A~^^ 
l + XjA-^Xtj 
uj A~^Xt 



Xt 



l + XjA-^Xt 



XjA-^Xt 
1 + XjA-^Xf ■ 



Plugging back into (12) yields Zl^^t = Ei^t[Ai^t] + Bi^t where the conditional bias 
Bi t satisfies 



ujA-^X, 

1 + xJa-^x 



A. 






< 






t 

Xt 



l + XjA-^Xt 

\\A- 



XjA-^Xt 
l + XjA-^Xt 
\A,t\\\Xt\\^\\A- 
1 + XjA-^Xt 



< 'll + 11^ 'll < 2||7l-i| 



Here the second inequality holds because ||Mi|| = ||Xt|| = 1 and < 

||t6i|| ||Xt|| = 1, and the third inequality holds because X^ A~^Xt > 0 by 
the positive definiteness of A~^. Recalling that ||7l-i|| = 1/(1 + Ai^t_i), where 
1 + is the smallest eigenvalue of t 1, concludes the proof. □ 



Proof of Lemma 3 

Setting for brevity H = Sjt_iA~^ Xt and r = Xj A~^ Xt we can write 



(^A + XtXj'^ ^ Si,t-iSj,t-i{A + XtXj'^ 



= x; (^A + XtXt 
= Xj ( 71-1 _ 



Xt 



_i A-^XtXjA-^ 



1 + XlA-^Xt 
(by the Sherman-Morrison formula) 



o o. / .-1 A-^XtXjA-^\ ^ 



1 + X^A-^Xt 
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= H^H —H^H —H^H + , ^ 

1 + r 1 + r (l + r)2 

^ H^H ^ Xj A-^Si,t-iSlt-^A~^Xt 
(1 + 0 ^ (1 + XjA-^Xt)^ 



< 



{l + XjA-^Xt) 



(l + r)2 

We continue by bounding the two factors in (13). Observe that 

-T .-1 -- . M ._1 ,, 1 



= X^A-^Xt<\\A-^\\ = 



1 + Ai f_i 



< 1 



(13) 



and that the function f{x) = x/{l + x)^ is monotonically increasing when x G 
[0, 1]. Hence 



r 

(1 + r)2 



f{r) < /( 



1 

1 + 



—) 

t-i 



1 + ^ 1 
{2 + 2 + Aj^t_i 



As far as the second factor is concerned, we just note that the two matrices 
and have the same eigenvectors. Therefore 






1 + A 



< 1 



where A is some eigenvalue of Substituting into (13) yields 

1 



i^ir < 



2 + 



as desired. 



□ 
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Abstract. We give an algorithm for the bandit version of a very general 
online optimization problem considered by Kalai and Vempala [1], for the 
case of an adaptive adversary. In this problem we are given a bounded 
set S C R" of feasible points. At each time step t, the online algorithm 
must select a point x* G S' while simultaneously an adversary selects a 
cost vector c* G R". The algorithm then incurs cost c* • x*. Kalai and 
Vempala show that even if S is exponentially large (or infinite) , so long as 
we have an efficient algorithm for the offline problem (given c G R" , hnd 
X G S to minimize c • x) and so long as the cost vectors are bounded, one 
can efficiently solve the online problem of performing nearly as well as 
the best fixed x G S in hindsight. The Kalai- Vempala algorithm assumes 
that the cost vectors c* are given to the algorithm after each time step. 
In the “bandit” version of the problem, the algorithm only observes its 
cost, c* -x*. Awerbuch and Kleinberg [2] give an algorithm for the bandit 
version for the case of an oblivious adversary, and an algorithm that 
works against an adaptive adversary for the special case of the shortest 
path problem. They leave open the problem of handling an adaptive 
adversary in the general case. In this paper, we solve this open problem, 
giving a simple online algorithm for the bandit problem in the general 
case in the presence of an adaptive adversary. Ignoring a (polynomial) 
dependence on n, we achieve a regret bound of 



1 Introduction 

Kalai and Vempala [1] give an elegant, efficient algorithm for a broad class of 
online optimization problems. In their setting, we have an arbitrary (bounded) 
set S C K" of feasible points. At each time step t, an online algorithm A must 
select a point x* G S' and simultaneously an adversary selects a cost vector 
c* G K" (throughout the paper we use superscripts to index iterations). The 
algorithm then observes c* and incurs cost c* • x*. Kalai and Vempala show that 
so long as we have an efficient algorithm for the offline problem (given c G M" 
find x G S to minimize c-x) and so long as the cost vectors are bounded, we can 
efficiently solve the online problem of performing nearly as well as the best fixed 
x G S' in hindsight. This generalizes the classic “expert advice” problem, because 
we do not require the set S to be represented explicitly: we just need an efficient 
oracle for selecting the best x G S in hindsight. Further, it decouples the number 
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of experts from the underlying dimensionality n of the decision set, under the 
assumption the cost of a decision is a linear function of n features of the decision. 
The standard experts setting can be recovered by letting S = {ei, . . . ,e„}, the 
columns of the n x n identity matrix. 

A problem that fits naturally into this framework is an online shortest path 
problem where we repeatedly travel between two points a and b in some graph 
whose edge costs change each day (say, due to traffic). In this case, we can 
view the set of paths as a set S of points in a space of dimension equal to the 
number of edges in the graph, and c* is simply the vector of edge costs on day t. 
Even though the number of paths in a graph can be exponential in the number 
of edges (i.e., the set S is of exponential size), since we can solve the shortest 
path problem for any given set of edge lengths, we can apply the Kalai-Vempala 
algorithm. (Note that a different algorithm for the special case of the online 
shortest path problem is given by Takimoto and Warmuth [3].) 

A natural generalization of the above problem, considered by Awerbuch and 
Kleinberg [2], is to imagine that rather than being given the entire cost vector c‘, 
the algorithm is simply told the cost incurred c* • x*. For example, in the case of 
shortest paths, rather than being told the lengths of all edges at time t, this would 
correspond to just being told the total time taken to reach the destination. Thus, 
this is the “bandit version” of the Kalai-Vempala setting. Awerbuch and Klein- 
berg present two results: an algorithm for the general problem in the presence 
of an oblivious adversary, and an algorithm for the special case of the shortest 
path problem that works in the presence of an adaptive adversary. The difference 
between the two adversaries is that an oblivious adversary must commit to the 
entire sequence of cost vectors in advance, whereas an adaptive adversary may 
determine the next cost vector based on the online algorithm’s play (and hence, 
the information the algorithm received) in the previous time steps. Thus, an 
adaptive adversary is in essence playing a repeated game. They leave open the 
question of achieving good regret guarantees for an adaptive adversary in the 
general setting. 

In this paper we solve the open question of [2], giving an algorithm for the 
general bandit setting in the presence of an adaptive adversary. Moreover, our 
method is significantly simpler than the special-purpose algorithm of Awerbuch 
and Kleinberg for shortest paths. Our bounds are somewhat worse: we achieve 
regret bounds of 0(T^/^\/ln T) compared to the 0(T^/^) bounds of [2]. We be- 
lieve improvement in this direction may be possible, and present some discussion 
of this issue at the end of the paper. 

The basic idea of our approach is as follows. We begin by noticing that the 
only history information used by the Kalai-Vempala algorithm in determining 
its action at time t is the sum vectors received so far 

(we use this abbreviated notation for sums over iteration indexes throughout the 
paper). Furthermore, the way this is used in the algorithm is by adding random 
noise /x to this vector, and then calling the offline oracle to find the x* G S' 
that minimizes -I- /x) • x*. So, if we can design a bandit algorithm that 

produces an estimate of and show that with high probability even 
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an adaptive adversary will not cause to differ too substantially from 

we can then argue that the distribution + /x is close enough to + fi 

for the Kalai-Vempala analysis to apply. In fact, to make our analysis a bit more 
general, so that we could potentially use other algorithms as subroutines, we will 
argue a little differently. Let OPT(c) = minxes(c • x). We will show that with 
high probability, OPT(c^-^) is close to OPT(c^-^) and satisfies conditions 
needed for the subroutine to achieve low regret on This means that our 
subroutine, which believes it has seen will achieve performance on 

close to OPT(c^-^). We then finish off by arguing that our performance on 
is close to its performance on c^'^. 

The behavior of the bandit algorithm will in fact be fairly simple. We begin 
by choosing a basis B of (at most) n points in S to use for sampling (we address 
the issue of how B is chosen when we describe our algorithm in detail). Then, 
at each time step t, with probability 7 we explore by playing a random basis 
element, and otherwise (with probability 1 — 7 ) we exploit by playing according 
to the Kalai-Vempala algorithm. For each basis element bj, we use our cost 
incurred while exploring with that basis element, scaled by n/ 7 , as an estimate 
of • hj. Using martingale tail inequalities, we argue that even an adaptive 

adversary cannot make our estimate differ too wildly from the true value of 
ci:t-i . ygg ^]^jg show that after matrix inversion, our estimate 

is close to its correct value with high probability. 

2 Problem Formalization 

We can now fully formalize the problem. First, however, we establish a few nota- 
tional conventions. As mentioned previously, we use superscripts to index itera- 
tions (or rounds) of our algorithm, and use the abbreviated summation notation 
when summing variables over iterations. Vectors quantities are indicated in 
bold, and subscripts index into vectors or sets. Hats (such as c*) denote esti- 
mates of the corresponding actual quantities. The variables and constants used 
in the paper are summarized in Table (1). 

As mentioned above, we consider the setting of [1] in which we have an 
arbitrary (bounded) set S' C K” of feasible points. At each time step t, the 
online algorithm A must select a point x* G S and simultaneously an adversary 
selects a cost vector c* G K". The algorithm then incurs cost c* • x*. Unlike [1], 
however, rather than being told c‘, the algorithm simply learns its cost c* • xL 
For simplicity, we assume a fixed adaptive adversary V and time horizon T 
for the duration of this paper. Since our choice of algorithm parameters depends 
on T, we assume^ T is known to the algorithm. We refer to the sequence of 
decisions made by the algorithm so far as a decision history, which can be written 
= [x^, . . . ,x*]. Let H* be the set of all possible decision histories of length 0 
through T — 1. Without loss of generality (e.g., see [5]), we assume our adaptive 
adversary is deterministic, as specified by a function V : H* — >■ K", a mapping 

^ One can remove this requirement by gnessing T, and donbling the guess each time 
we play longer than expected (see, for example, Theorem 6.4 from [4]). 
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from decision histories to cost vectors. Thus, V(/i*“^) = c* is the cost vector for 
timestep t. 

We can view our online decision problem as a game, where on each iteration 
t the adversary V selects a new cost vector c* based on h*~^, and the online 
algorithm A selects a decision x G S' based on its past plays and observations, and 
possibly additional hidden state or randomness. Then, A pays c^-x* and observes 
this cost. For our analysis, we assume a Li bound on S, namely ||x||i < D/2 for 
all X G S, so ||x — y||i < D for all x,y G S. We also assume that |c • x| < M 
for all X G S and all c played by V. We also assume S is full rank, if it is not we 
simply project to a lower-dimensional representation. Some of these assumptions 
can be lifted or modified, but this set of assumptions simplifies the analysis. 

For a fixed decision history and cost history k'^ = (c^, . . . ,c^), we de- 
fine loss(ft.^,fc^) = ^or a randomized algorithm A and adver- 

sary V, we define the random variable loss(yl, V) to be loss(/i^, fc^), where 
hA is drawn from the distribution over histories defined by A and V, and 
F = (V(/i°),... When it is clear from context, we will omit the 

dependence on V, writing only loss(yl). 

Our goal is to define an online algorithm with low regret. That is, we want 
a guarantee that the total loss incurred will, in expectation, not be much larger 
than the optimal strategy in hindsight against the cost sequence we actually 
faced. To formalize this, first define an oracle TZ : M" — >■ S that solves the offline 
optimization problem, TZ{c) = argmin^gg(c • x). We then define OPT(/c^) = 
Similarly, OPT(V, .4) is the random variable OPT(/c^) when 
is generated by playing V against A. We again drop the dependence on V and 
A when it is clear from context. Formally, we define expected regret as 



F;[1oss(AV) -OPT(AV)] 



E[loss{A, V)]-E 



T 

min > (A ■ x) . 

t=i 



( 1 ) 



Note that the Fl[OPT(V, .4)] term corresponds to applying the min operator 
separately to each possible cost history to find the best fixed decision with respect 
to that particular cost history, and then taking the expectation with respect to 
these histories. In [5], an alternative weaker definition of regret is given. We 
discuss relationships between the definitions in Appendix B. 



3 Algorithm 

We introduce an algorithm we call BGA, standing for Bandit-style Geomet- 
ric decision algorithm against an Adaptive adversary. The algorithm alternates 
between playing decisions from a fixed basis to get unbiased estimates of costs, 
and playing (hopefully) good decisions based on those estimates. In order to 
determine the good decisions to play, it uses some online geometric optimization 
algorithm for the full observation problem. We denote this algorithm by GEX 
( Geometric Experts algorithm) . The implementation of GEX we analyze is based 
on the FPL algorithm of Kalai and Vempala [1]; we detail this implementation 
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Choose parameters 7 and e, where e is a parameter of GEX 
t = 1 

Fix a basis B = {bi, . . . , b„} C S 
while playing do 

Let X* = 1 with probability 7 and x* = 0 otherwise 

if X* = 0 then 

Select X* from the distribution GEX(c^, . . . , c*“^) 
Incur cost ■ x* 

c* = 0 e R" 
else 

Draw j uniformly at random from {1, . . . , n} 
x‘ = hj 

Incur cost and observe = c* • x* 

Define £ by = 0 for i j and £j = {njy)z^ 
c* = 

end if 

c = c + c 
t = t + l 

end while 

Algorithm 1: BGA 



and analysis in Appendix A. However, other algorithms could be used, for exam- 
ple the algorithm of Zinkevich [6] when S is convex. We view GEX as a function 
from the sequence of previous cost vectors ,c‘“^) to distributions over 

decisions. 

Pseudocode for our algorithm is given in Algorithm (1). On each timestep, 
we make decision x*. With probability (1 — 7), BGA plays a recommendation 
x* = x‘ G S' from GEX. With probability 7, we ignore x* and play a basis 
decision, x* = uniformly at random from a sampling basis H = {bi,...,b„}. 
The indicator variable x* is 1 on exploration iterations and 0 otherwise. 

Our sampling basis B is a nx n matrix with columns b^ G S, so we can write 
x = Bw for any x G K" and weights w G R”. For a given cost vector c, let 
£ = B^c (the superscript f indicates transpose). This is the vector of decision 
costs for the basis decisions, so = c* ■ hi. We define £ , an estimate of £*, as 
follows: Let £ = 0 € M” on exploitation iterations. If on an exploration iteration 
we play bj, then £ is the vector where = 0 for f j and £* = ^(c* • b^). 
Note that c* • hj is the observed quantity, the cost of basis decision b^ . On each 
iteration, we estimate c* by c‘ = {B"^)~^£ . It is straightforward to show that 
£ is an unbiased estimate of basis decision costs and that c is an unbiased 
estimate of c* on each timestep t. 

The choice of the sampling basis plays an important role in the analysis of 
our algorithm. In particular, we use a baricentric spanner, introduced in [2]. A 
baricentric spanner B = {bi,...,b„} is a basis for S such that b^ G S' and for 
all x G S we can write x = Bw with coefficients Wi G [—1, 1]. It may not be easy 
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to find exact baricentric spanners in all cases, but [2] proves they always exist 
and gives an algorithm for finding 2-approximate baricentric spanners (where 
the weights Wi G [—2, 2]), which is sufficient for our purposes. 



Table 1. Summary of notation 



SCR" 


set of decisions, a compact subset of R" 


D gR 


Li bound on diameter of S, Vx, y G S, |x — y|i < D 


n G N 


dimension of decision space 


h* 


decision history , h* = x^ , . . . , x* 


H* 


set of possible decision histories 


V : ^ R" 


adversary, function from decision histories to cost vectors 


A 


an online optimization algorithm 


Qt-l 


history of BGA randomness for timesteps 1 through t — 1 


c‘ e R" 


cost vector on time t 


c‘ e R" 


BGA’s estimate of the cost vector on time t 


M € R+ 


bound on single- iteration cost, |c‘ • x‘| < M 


BGS 


sampling basis B = {bi, . . . , b„} 


/3oo G R 


matrix max norm on (B^)~^ 


€ [-M, M]" 


vector, l\ = c* ■ bi for bi G B 


t G R" 


BGA’s estimate of 


T G N 


end of time, index of final iteration 


x‘ G S 


BGA’s decision on time t 


x‘ G S 


decision recommended by GEX on time t 


X* e {0,1} 


indicator, x* = 1 if BGA explores on t, 0 otherwise 


7 C [0, 1] 


the probability BGA explores on each timestep 


G [-M, M] 


BGA’s loss on iteration t, 2 * = • x‘, 


G [-i?, R] 


loss of GEX, 2 * = c‘ • x‘ 



4 Analysis 

4.1 Preliminaries 

At each time step, BGA either (with probability 1— 7) plays the recommendation 
X* from GEX, or else (with probability 7) plays a random basis vector from 
B. For purposes of analysis, however, it will be convenient to imagine that we 
request a recommendation x* from GEX on every iteration, and also that we 
randomly pick a basis to explore, b* G {bi, . . . , b„}, on each iteration. We then 
decide to play either x* or b* based on the outcome of a coin of bias 7. 
Thus, the complete history of the algorithm is specified by the algorithm history 
b^, x^, b^, . . . , x‘“^, b*“^], which encodes all previous 

random choices. The sample space for all probabilities and expectations is the 
set of all possible algorithm histories of length T. Thus, for a given adversary V, 
the various random variables and vectors we consider, such as x*,c*,c*,x‘, and 
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others, can all be viewed as functions on the set of possible algorithm histories. 
Unless otherwise stated, our expectations and probabilities are with respect to 
the distribution over these histories. 

A partial history can be viewed a subset of the sample space (an event) 
consisting of all complete histories that have as a prefix. We frequently 

consider conditional distributions and corresponding expectations with respect 
to partial algorithm histories. For instance, if we condition on a history 

the random variables ,c* , £ ,...,£ , c^, . . . x^, . . . 

and . . . , ^re fully determined. 

We now outline the general structure of our argument. Let z* = c* • x* be 
the loss perceived by the GEX on iteration t. In keeping with earlier definitions, 
loss(BGA) = and loss(GEX) = z^'^. We also let OPT = OPT(BGA, V) = 
• TZ{c^'^), the performance of the best post-hoc decision, and similarly 
OPT = OPT(c\ . . . , c^) = c^'-* ■ TZ{c^'*). 

The base of our analysis is a bound on the loss of GEX with respect to the 
cost vectors c* of the form 

E[loss(GEX)] < E[6 pT] -k (terms). (2) 

Such a result is given in Appendix A, and follows from an adaptation of the 
analysis from [1]. We then prove statements having the general form 

E[loss(BGA)] < A[loss(GEX)] -I- (terms) (3) 

and 

A [OPT] < E[OPT] -k (terms). (4) 

These statements connect our real loss to the “imaginary” loss of GEX, and 
similarly connect the loss of the best decision in GEX’s imagined world with the 
loss of the best decision in the real world. Gombining the results corresponding 
to Equations (2), (3), and (4) leads to an overall bound on the regret of BGA. 



4.2 High Probability Bounds on Estimates 



We prove a bound on the accuracy of BGA’s estimates £ , and use this to show 
a relationship between OPT and OPT of the form in Equation 4. 

Define random variables e° = 0 and e* = £* — £ . We are really interested 
in the corresponding sums e^'*, where e}'* is the total error in our estimate of 
• hi. We now bound |e^'*|. 



Theorem 1. For A > 0, 




116 



H.B. McMahan and A. Blum 



Proof. It is sufficient to show the sequence . . . ,e^'^ of ran- 

dom variables is a bounded martingale sequence with respect to the filter 
G^,G^, . . . ,G^; that is, E[el'* \ G*“^] = The result then follows from 

Azuma’s Inequality (see, for example, [7]). 

First, observe that ef* = — £* + . Further, the cost vector c* is 

determined if we know and so £* is also fixed. Thus, accounting for the (( 

probability we explore a particular basis decision b^, we have 




and so we conclude that the ej'* forms a martingale sequence. Notice that |eb* — 
= \£\ —£\\. If we don’t sample, £\ = Q and so |eb* — < M. If we do 

sample, we have £\ = and so |eb* — This bound is worse, so 

it holds in both cases. The result now follows from Azuma’s inequality. □ 

Let /3oo = II (7?^)“^ ||oo, a matrix Loo-norm on (Bt)-i, so that for any w, 
||(Bl)“^w|loo < /3oo ||w||oo- 

Corollary 1. For S G (0, 1], and all t from 1 to T, 

Pr ||c^’‘ - c^’*||oo > /3 ooT((5,7)v^ < 
where J(<5, 7) = ^nMyj2 ln(2n/<5). 

Proof. Solving 8jn = 2e~^ 1"^ yields A = i/2 ln(2n/i5), and then using this value 
in Theorem (1) gives 

Pr |ei'*| > 

for all i G {1,2,..., n}. Then, 

n 

Pr ||e^’‘||oo > ^('J,7)Vt |e--*| > J(^,7)Vt 

i=l 

<s 

by the union bound. Now, notice that we can relate £ and by 

r— 1 r— 1 T— 1 

and similarly for and Then 

Pr [||ci’* - ci’‘|U > /3ooT(<5,7)Vt] = Pr - C’*)||oo > P^J{6,j)Vi 

< Pr Poo\\e^''*\\oo> PooJ{S,j)Vi 
= Pr ||e^’‘||oo > J(i5,7)Vt 

< < 5 . 



□ 
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We can now prove our main result for the section, a statement of the form 
of Equation (4) relating OPT and OPT : 

Theorem 2. If we play V against EGA for T timesteps, 

E[6 pT] < E[OPT] + (1 - (5) + SMT. 

Proof. Let — c^'^. By definition of TZ, or 

equivalently 7^(c^'^ + <?) • (c^'^ + ^) < 7?.(c^'^) • (c^'^ + <?), and so by expanding 
and rearranging we have 

. cl:T _ 7 ^(c1:T) . ^1:T < - TZ{c^''^ + <P)) ■ <P 

<D\m^- ( 5 ) 



Then, 



I OPT - 6 pT| = \TZ{c^'-^) ■ - n{c^'-^ + <P) ■ + ^)\ 

< \{n{c^'-'^) - n{c^-'^ + <p)) ■ + \n{c^''^ + ^)-<p\ 

<{D + D/2)\\<1>\\^, 

where we have used Equation (5). Recall from Section (2), we assume ||x||i < 
D/2 for all x G S', so ||x — y||i < D for all x,y G S. The theorem follows by 
applying the bound on <I> given by Corollary (1), and then observing that the 
above relationship holds for at least a 1 — S fraction of the possible algorithm 
histories. For the other S fraction, the difference might be as much as SMT. 
Writing the overall expectation as the sum of two expectations conditioned on 
whether or not the bound holds gives the result. □ 



4.3 Relating the Loss of BGA and Its GEX Subroutine 

Now we prove a statement like Equation (3), relating loss(BGA) to loss(GEX). 
Theorem 3. If we run BGA with parameter 7 against V for T timesteps, 
E[loss(BGA)] < (1 - 7)E[Zoss(GEX)] + 7 MT. 



Proof. For a given adversary V, fully determines the sequence of cost vec- 
tors given to algorithm GEX. So, we can view GEX as a function from 
to probability distributions over S. If we present a cost vector c to GEX, then 
the expected cost to GEX given history is I • x). If 

we define x* = I G'*“^)x, we can re-write the expected loss of GEX 

against c as c-x‘; that is, we can view GEX as incurring the cost of some convex 
combination of the possible decisions in expectation. Let £ he t given that we 
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^ ^ 3 

explore by playing basis vector hj on time t, and similarly let ’ 

Observe that for j = i and 0 otherwise, and so 



7^1 ^ ^ 



(6) 



Now, we can write 

n 

E[z* I Cy = (1-^)0 + 7E-E I • *‘) 



n 

i=i x*eS 



= 7 



” 1 

E-' 

T) 






Li=i 

= 

n 



J2i 

i=i 






X*, and using Equation (6), 



= c‘-x‘. 



Now, we consider the conditional expectation of 2* and see that 

n 

E\z* I G‘-i] = (l- 7 )(c‘-x‘)+ 7 ^-(c‘-b,) 

1—1 

< (l-7)if[^M G‘-1]+7M, (7) 

Then we have, 

Ey = E [E[z* I Cy] 

< E [{1 - -f)E[z^ I G‘-1]+7M] 

= (1-7)E[E[z‘ I G‘-i]] +7M 

= (l-7)i^[2‘] + 7M, (8) 

by using the inequality from Equation (7). The theorem follows by summing the 
inequality (8) over t from 1 to T and applying linearity of expectation. □ 



4.4 A Bound on the Expected Regret of BGA 

Theorem 4. If we run BGA with parameter 7 using subroutine GEX with pa- 
rameter e (as defined in Appendix A), then for all S € (0, 1], 

if [Zoss(BGA)] 

< EfOPT] + G 1 E-nMyin(2n/S)\/f + dMT + + - + 7MT 

V 7 7^ e 
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Proof. In Appendix A, we show an algorithm to plug in for GEX, based on 
the FPL algorithm of [1] and give bounds on regret against a deterministic 
adaptive adversary. We first show how to apply that analysis to GEX running 
as a subroutine to EGA. 

First, we need to bound |c* • x|. By definition, for any x G S', we can write 
X = Bw for weights w with G [~lj 1] (or [—2,2] if it is an approximate 

baricentric spanner). Note that \\£ ||i < {^)M, and for any x G S, we can write 
X as Bw where G [—2, 2]. Thus, 

|c‘ • x| = |(St)-if . Bw\ = \Ci'yB-^Bw\ = \£* ■ w| < ||£‘||i ||w|U < — • 



Let R = 2nM/j. Suppose at the beginning of time we fix the random deci- 
sions of BGA that are not made by GEX, that is, we fix a sequence X = 
[x^,b^, . . . ,x^,b^]. Fixing this randomness together with V determines a new 
deterministic adaptive adversary V that GEX is effectively playing against. To 
see this, let hf~^ = [x^, . . . ,x*“^]. If we combine hf~^ with the information in 
A, it fully determines a partial history . If we let = [x^, . . . ,x*“^] be 
the partial decision history that can be recovered from G*~^, then = 

Thus, when GEX is run as a subroutine of BGA, we can apply 
Lemma (2) from the Appendix and conclude 

4:7Ty 

E[loss(GEX) I A] < E[OPT | A] -h e(4n -f 2)R^T + — (9) 

For the remainder of this proof, we use big-Oh notation to simplify the presen- 
tation. Now, taking the expectation of both sides of Equation (9), 

E[loss(GEX)] < E[6 pT] -h O (^enR^T + 

Applying Theorem (3), 

E[loss(BGA)] < (1 - 7 )E[ 6 pT] -h O (enR'^T +^+ 
and then using Theorem (2) we have 
E[loss(BGA)] 

< (1 - 7)E[0PT] -h O j)dVt + SMT + enR^T +^+ jMt'^ 

< E[OPT] +o( D-nMj2ln(2n/6)Vr + SMT + +-+ jMT 

V 7 Ye 



For the last line, note that while E[OPT] could be negative, it is still bounded by 
MT, and so this just adds another 'jMT term, which is captured in the big-Oh 
term. □ 
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Ignoring the dependence on n, M, and D and simplifying, we see BGA’s 
expected regret is bounded by 



£i[regret(BGA)] = O + ST - 



eT 1 \ 

^ + -+jT). 
T e / 



Setting 7 = 5 = T ^1'^ and e = T we get a bound on our loss of order 



5 Conclusions and Open Problems 

We have presented a general algorithm for online optimization over an arbitrary 
set of decisions S C K”, and proved regret bounds for our algorithm that hold 
against an adaptive adversary. 

A number of questions are raised by this work. In the “flat” bandits prob- 
lem, bounds of the form 0{^/T) are possible against an adaptive adversary [4]. 
Against a oblivious adversary in the geometric case, a bound of is 

achieved in [2]. We achieve a bound of 0(T^/^-\/ln T) for this problem against 
an adaptive adversary. In [4], lower bounds are given showing that the 0{Vt) 
result is tight, but no such bounds are known for the geometric decision-space 
problem. Gan the 0(r^/'^\/lnT) and possibly the bounds be tightened 

to 0{VT)7 a related issue is the use of information received by the algorithm; 
our algorithm and the algorithm of [2] only use a 7 fraction of the feedback they 
receive, which is intuitively unappealing. It seems plausible that an algorithm 
can be found that uses all of the feedback, possibly achieving tighter bounds. 
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A Specification of a Geometric Experts Algorithm 

In this section we point out how the FPL algorithm and analysis of [1] can be 
adapted to our setting to use as our GEX subroutine, and prove the correspond- 
ing bound needed for Theorem (4). In particular, we need a bound for an arbi- 
trary S C K” and arbitrary cost vectors, requiring only that on each timestep, 
|c • x| < R. Further, the bound must hold against an adaptive adversary. 

FPL solves the online optimization problem when the entire cost vector c* 
is observed at each timestep. It maintains the sum and on each timestep 

plays decision x* = TZ{c^'*~^ + /x), where /jl is chosen uniformly at random 
from [0, 1/e]", given e, a parameter of the algorithm. The analysis of FPL in [1] 
assumes positive cost vectors c satisfying ||c||i < A, and positive decision vectors 
from S C K" with jjx — yjji < D for all x,y G S and |c • x — c • y| < R for 
all cost vectors c and x, y G 5. Further, the bounds proved are with respect to 
a fixed series of cost vectors, not an adaptive adversary. We now show how to 
bridge the gap from these assumptions to our assumptions. 

First, we adapt an argument from [2], showing that by using our baricentric 
spanner basis, we can transform our problem into one where the assumptions 
of FPL are met. We then argue that a corresponding bound holds against an 
adaptive adversary. 

Lemma 1. Let S C K" be a set of (not necessarily positive) decisions, and 
fc* = [c^, . . . , c^] a set of cost vectors on those decisions, such that jc* • x| < i? 
for all X € S and c* G /c*. Then, there is an algorithm A{e) that achieves 

4?7 

E[loss(A(e), fc*)] < OPT(/c‘) -L e(4n -L 2)R^T + — 

Proof. This an adaptation of the arguments of Appendix A of [2]. Fix a bari- 
centric spanner B — {bi,...,b„} for S. Then, for each x G S', let x = Bw and 
define /(x) = w^, wi, . . . , w„j. Let /(S) = S'. For each cost vector c* 

define g{c*) = [i?, i? -I- c* • bi, . . . , i? -I- c* • b„j. It is straightforward to verify 
that c' • X = g{A) ■ /(x), and further g(c*) > 0, ||(/(c*)||i < (2n -|- l)i?, and the 
difference in cost of any two decisions against a fixed g{c^) is at most 2R. By 
definition of a baricentric spanner, G [—1,1] and so the Li diameter of S' is 
at most 4n. Note the assumption of positive decision vectors in Theorem 1 of [1] 
can easily be lifted by additively shifting the space of decision vectors until it 
is positive. This changes the loss of the algorithm and of the best decision by 
the same amount, so additive regret bounds are unchanged. The result of this 
lemma then follows from the bound of Theorem 1 from [1] . □ 
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Now, we extend the above bound to play against an adaptive adversary. 
While we specialize the result to the particular algorithm implied by Lemma (1), 
the argument is in fact more general and can be extended to all self-oblivious 
algorithms, that is, algorithms whose play depends only on the cost history [8]. 

Lemma 2. Let S C K” he a set of (not necessarily positive) decisions, and V 
be an adaptive adversary such that |c* -x| < i? for allx G S and any c* produced 
by the adversary. Then, if we run A{e) from Lemma (1) against this adversary, 

4tT/ 

Ll[loss(^(e), V)] < L;[OPT(^(e), V)] -h e(4n -h 2)R^T + 

Proof. Fixing V also determines a distribution over decision/cost histories. Our 
expectations for this Lemma are with respect to this distribution. Let k'^ = 
[c^, . . . , c^], and let be the first t costs in k'^ . Note that ^(e) is self-oblivious, 
so X* = I is well defined. Adopting our earlier notation, let 

2 * be our loss on time t, so, loss(^(e)) = . Then, 

\k'^] = J2 = \ 

t=l t=l t=l 

Now, consider the oblivious adversary that plays the fixed sequence of cost vec- 
tors k'^ = [c^, . . . , c^]. It is easy to see the expected loss to FPL against this 
adversary is also c*-x*, and so the performance bound from Lemma (1) ap- 
plies. The result follows by writing i?[loss(^(e), V)] = E[ i?[loss(^(e), V) | k'^] ], 
and applying that bound to the inner expectation. □ 

Thus, we can use -4(e) as our GEX subroutine for full-observation online 
geometric observation. 



B Notions of Regret 



In [5], an alternative definition of regret is given, namely. 



F1[1ossv,.4(/i^)] — mini? 

x^S 



r T 






X 



(10) 



This definition is equivalent to ours in the case of an oblivious adversary, but 
against an adaptive adversary the “best decision” for this definition is not the 
best decision for a particular decision history, but the best decision if the decision 
must be chosen before a cost history is selected according to the distribution over 
such histories. In particular. 
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■ 




- rp 
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min > 
x^S ^ 


■ X 


< min E 

X^S 


^C*-X 
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and so a bound on Equation (1) is at least as strong as a bound on Equation (10). 
In fact, bounds on Equation (10) can be very poor when the adversary is adap- 
tive. There are natural examples where the stronger definition (1) gives regret 
0{T) while the weaker definition (10) indicates no regret. Adapting an example 
from [5], let S = {ei, . . . ,e„} (the “fiat” bandit setting) and consider the algo- 
rithm A that plays uniformly at random from S. The adversary V gives = 0, 
and if A then plays on the first iteration, thereafter the adversary plays the 
cost vector c* where c( = 0 and c‘- = 1 for j ^ i. The expected loss of A is 
For regret as defined by Equation (10), miuxes • x] = indicating 

no regret, while E[minxgs(c^’^ • x)] = 0, and so the stronger definition indicates 
0{T) regret. 

Unfortunately, this implies like the proof techniques for bounds on expected 
weak regret like those in [4] and [2] cannot be used to get bounds on regret as 
defined by Equation (1). The problem is that even if we have unbiased estimates 
of the costs, these cannot be used to evaluate the term ' ^)] 

in (1) because min is a non-linear operator. We surmount this problem by proving 
high-probability bounds on our estimates of c*, which allows us to use a union 
bound to evaluate the expectation over the min operator. Note that the high 
probability bounds proved in [4] and [2] can be seen as corresponding to our 
definition of expected regret. 
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Abstract. Probabilistic finite automata (PFA) model stochastic lan- 
guages, i.e. probability distributions over strings. Inferring PFA from 
stochastic data is an open field of research. We show that PFA are iden- 
tifiable in the limit with probability one. Multiplicity automata (MA) 
is another device to represent stochastic languages. We show that a 
MA may generate a stochastic language that cannot be generated by 
a PFA, but we show also that it is undecidable whether a MA gener- 
ates a stochastic language. Finally, we propose a learning algorithm for 
a subclass of PFA, called PRFA. 



1 Introduction 

Probabilistic automata (PFA) are formal objects which model stochastic lan- 
guages, i.e. probability distributions over words [1]. They are composed of a 
structure which is a finite automaton (NFA) and of parameters associated with 
states and transitions which represent the probability for a state to be initial, 
terminal or the probability for a transition to be chosen. Given the structure of 
a probabilistic automaton A and a sequence of words u\, . . . ,Un independently 
distributed according to a probability distribution P, computing parameters for 
A which maximize the likelihood of the observation is NP-hard [2]. However in 
practical cases, algorithms based on the EM {Expectation- Maximization) method 
[3] can be used to compute approximate values. On the other hand, inferring a 
probabilistic automaton (structure and parameters) from a sequence of words is 
a widely open field of research. In some applications, prior knowledge may help 
to choose a structure (for example, the standard model for biological sequence 
analysis [4]). Without prior knowledge, a complete graph structure can be cho- 
sen. But it is likely that in general, inferring both the appropriate structure and 
parameters from data would provide better results (see for example [5]). 

Several learning frameworks can be considered to study inference of PFA. 
They often consist in adaptations to the stochastic case of classical learning 
models. We consider a variant of the identification in the limit model of Gold 
[6], adapted to the stochastic case in [7]. Given a PFA A and a sequence of 
words Ui, . . . ,Un, ■ ■ ■ independently drawn according to the associated distribu- 
tion Pa, an inference algorithm must compute a PFA A„ from each subsequence 
ui, . . . ,Un such that with probability one, the support of is stationary from 
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some index n and Pa„ converges to Pa', moreover, when parameters of the tar- 
get are rational numbers, it can be requested that A„ itself is stationary from 
some index. The set of probabilistic automata whose structure is deterministic 
(PDFA) is identifiable in the limit with probability one [8,9,10], the identification 
being exact when the parameters of the target are rational numbers. However, 
PDFA are far less expressive than PFA, i.e. the set of probability distributions 
associated with PDFA is stricly included in the set of distributions generated 
from general PFA. We show that PFA are identifiable in the limit, with exact 
identification when the parameters of the target are rational numbers (Section 3) . 

Multiplicity automata (MA) are devices which model functions from E* to 
M. It has been shown that functions that can be computed by MA are very 
efficiently learnable in a variant of the exact learning model of Angluin, where 
the learner can ask equivalence and extended membership queries[\l, 12, VS\. As 
PFA are particular MA, they are learnable in this model. However, the learning 
is improper in the sense that the output function is not a PFA but a multiplicity 
automaton. We show that a MA is maybe not a very convenient representation 
scheme to represent a PFA if the goal is to learn it from stochastic data. This 
representation is not robust, i.e. there are MA which do not compute a stochastic 
language and which are arbitrarily close to a given PFA. Moreover, we show that 
it is undecidable whether a MA generates a stochastic language. That is, given 
a MA computed from stochastic data: it is possible that it does not compute a 
stochastic language and there may be no way to detect it! We also show that MA 
can compute stochastic languages that cannot be computable by PFA. These two 
results are proved in Section 4: they solve problems that were left open in [1]. 

Our identification in the limit algorithm of PFA is far from being efficient 
while algorithms that identifies PDFA in the limit can also be used in prac- 
tical learning situations (ALERGIA [8], RLIPS [9], MDI [14]). Note also that 
we do not have a model that describes algorithms “that can be used in prac- 
tical cases”: identification in the limit model is clearly too weak, exact learn- 
ing via queries is irrealistic, PAC-model is maybe too strong (PDFA are not 
PAC-learnable [15]). So, it is important to define subclasses of PFA, as rich as 
possible, while keeping good empirical learnability properties. We have intro- 
duced in [16,17] a new class of PFA based on the notion of residual languages: 
a residual language of a stochastic language P is the language u~^P defined by 
u~^P{v) = P{uv)/P{uE*). It can be shown that a stochastic language can be 
generated by a PDFA iff it has a finite number of residual languages. We consider 
the class of Probabilistic Residual Finite Automata (PRFA) : a PFA A is a PRFA 
iff each of its states generates a residual language of Pa- It can be shown that 
a stochastic language can be generated by a PRFA iff Pa has a finite number 
of prime residual languages Ui^P,... ,u~^P sufficient to express all the resid- 
ual languages as a convex linear combination of . . . ,u~^P, i.e. for every 

word V, there exist non negative real numbers such that v~^P = ^aiU~^P 
([17,16]). Clearly, the class of PRFA is much more expressive than PDFA. We 
introduce a first learning algorithm for PRFA, which identifies this class in the 
limit with probability one, and can be used in practical cases (Section 5). 
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2 Preliminaries 

2.1 Automata and Languages 

Let A be a finite alphabet, and S* be the set of words on S. The empty word 
is denoted by e and the length of a word u is denoted by |u|. Let < denote the 
length-lexicographic order on S* . A language is a subset of S* . For any language 
L, let pref (L) = {u & S*\3v & S* ,uv & L}. L \s prefixial iff L = pref (L). 

A non deterministic finite automaton (NFA) is a 5-tuple A = {E, Q, Qg, F, S) 
where Q is a finite set of states, Qo ^ Q is the set of initial states, F C Q is the 
set of terminal states, <5 is the transition function defined from Qx S to2^ . Let <5 
also denote the extension of the transition function defined from 2'^ x S* to 2^ . 
An NFA is deterministic (DFA) if |Qo| = 1 and if Vg € Q,Vx € E, |<5((7, x)| < 1. 
An NFA is trimmed if for any state q,q& 5 {Qq, A'*) and 6{q, A’*)nF 0. Let A = 
(E, Q, Qo, F, 5) be an NFA. A word m G A"* is recognized by A if 5{Qq, u)r\F yf 0. 
The language recognized by A is La = {u G E* \ S{Qq, u) fl F y^ 0}. 



2.2 Multiplicity and Probabilistic Automata, Stochastic Languages 

A multiplicity automaton (MA) is a 5-tuple {E,Q,ip, l,t) where Q is a finite 
set of states, the transition function, r : Q — >■ IR is the 

initialization function and t : Q — >■ IR is the termination function. We extend the 
transition function tp to Q x E* x Q hy ip{q,wx,r) = (p{q,w, s)ip{s,x,r) 

where x G E and p{q,e,r) = 1 ii q = r and 0 otherwise. We extend again p to 
Q X 2^* X 2<3 by p{q, U, R) = J2nx^u w, r). Let A = {E, Q, cp, l, t) be a 

MA. Let Pa be the function defined by: Pa{u) = '^q<zQ ^reQ ■ 

The support of A is the NFA {E,Q,Qj,Qt,5) where Qi = {qG Q \ i{q) 0}, 
Qt = {q G Q \ T{q) yf 0} and 5{q,x) = {r G Q\ p{q,x,r) yf 0} for any state q 
and letter x. An MA is said to be trimmed if its support is a trimmed NFA. 

A semi-PFA is a MA such that L,p and r take their values in [0,1], 
'-('?) ^ for any state q, T{q) + p{q, E,Q) < 1. A Probabilistic Finite 

Automaton (PFA) is a trimmed semi-PFA such that ^q^Q t(<z) = 1 and for any 
state q, T{q) + p{q,E,Q) = 1. A Probabilistic Deterministic Finite Automaton 
(PD FA) is a PFA whose support is deterministic. 

A stochastic language on F is a probability distribution over E* , i.e. a 
function P defined from E* to [0,1] such that ~ func- 

tion Pa associated with a PFA A is a stochastic language. Let us denote by 
S the set of stochastic languages on E. Let P G S and let res(P) = {u G 
E*\P{uE*) yf 0}. Let u G res(P), the residual language of P associated with 
u is the stochastic language u~^P defined by u~^P{w) = P{uw) / P{uE*). Let 
Res(P) = {u~^P\ u G res(P)}. It can easily be shown that Res (P) spans a 
finite dimension vector space iff P can be generated by a MA. Let MA^ be the 
set composed of MA which generate stochastic languages. Let us denote by 5 ma 
( resp. iSpFA, >5 pdfa) the set of stochastic languages generated by MA (resp. PFA, 
PDFA) on E. Let R C MA. Let us denote by P[Q] the set of elements of R, the 
parameters of which are all in (Q. 
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2.3 Learning Stochastic Languages 

We are interested in learnable subsets of MA which generate stochastic lan- 
guages. Several learning model can be used, we consider two of them. 

Identification in the limit with probability 1. The identification in the 
limit learning model of Gold [6] can be adapted to the stochastic case ([7]). 

Let P G S and let S' be a finite sample drawn according to P. For any X C 
S*, let Ps{X) = card(S) ^xeS be the empirical distribution associated 
with S. A complete presentation of P is an infinite sequence S of words generated 
according to P. Let S„ be the sequence composed of the n first words (not 
necessarily different) of S. We shall write P„ (A) instead of Ps„ (A) . 

Definition 1. Let TZ C MA^. TZ is said to he identifiable in the limit with 
probability one if there exists a learning algorithm C such that for any R G TZ, 
with probability 1, for any complete presentation S of Pr, L computes for each 
Sn given as input, a hypothesis Rn such that the support of Rn is stationary from 
some index n* and such that -G Pr os n — >■ oo. Moreover, TZ is strongly 
identifiable in the limit with probability one if Pr^ is also stationary from some 
index. 

It has been shown that PDFA is identifiable in the limit with probability one [8, 
9] and that PDFA[Q] is strongly identifiable in the limit [10]. 

We show below that PFA is identifiable in the limit with probability one and 
that PFA[(Q] is strongly identifiable in the limit. 

Learning using queries. The MAT model of Angluin [18], which allows to use 
membership queries (MQ) and equivalence queries (EQ) has been extended to 
functions computed by MA. Let P be the target function, let m be a word and 
let A be a MA. The answer to the query MQ(u) is the value P{u); the answer 
to the query EQ(A) is YES if Pa = P and NO otherwise. Functions computed 
by MA can be learned exactly within polynomial time provided that the learn- 
ing algorithm can make extended membership queries and equivalence queries. 
Therefore, any stochastic language in 5 ma can be learned by this algorithm. 

However, using MA to represent stochastic languages has some drawbacks: 
first, this representation is not robust, i.e. a MA may compute a stochastic 
language for a given set of parameters 9q and computes a function which is not 
a stochastic language for any 6 yf do; moreover, it is undecidable whether a MA 
computes a stochastic language. That is, by using MA to represent stochastic 
languages, a learning algorithm using approximate data might infer a MA which 
does not compute a stochastic language and with no means to detect it. 

3 Identifying <Spfa in the Limit 

We show in this Section that 5 pfa is identifiable in the limit with probability 
one. Moreover, the identification is strong when the target can be generated by 
a PFA whose parameters are rational numbers. 
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3.1 Weak Identification 

Let P be a stochastic language over S, let A = {Ai)i^i be a family of subsets of 
S*, let S' be a finite sample drawn according to P, and let Ps be the empirical 
distribution associated with S. It can be shown [19,20] that for any confidence 
parameter 6, with a probability greater than 1 — i5, for any i G I, 

IPs(a.)-p(a,)1<cA^^ ( 1 ) 

where VC (.4) is the dimension of Vapnik-Chervonenkis of A and c is an universal 
constant. When A = VC(.4) = 1. Let 5) = (1 — log |). 

Lemma 1. Let P G S and let S be a complete presentation of P. For any 
precision parameter e, any confidence parameter 6, any n > F{e,S), with a 
probability greater than 1 — \Pn{w) — P{w) \ < e for all w G S* . 

Proof. Use Inequality (1). □ 

For any integer k, let Qfc = {1, . . . ,k} and let Ok = {ii,Ti, <pfj\i,j G Qk, x G S} 
be a set of variables. We consider the following set of constraints Ck on Ok- 

{ 0 < Li,Ti,ipfj < 1 for any i,j G Qk,x G S, 
b<l, 

U + Flj < 1 for any i G Qk- 

Any assignment 9 of these variables satisfying Ck is said to be valid; any 
valid assignement 9 defines a semi-PFA by letting i{i) = tj, r{i) = Ti and 
(^(i, x,j) = ipf j for any states i and j and any letter x. We simply denote by Pg 
the function associated with A^. Let 14 be the sets of valid assignments. 
For any 9 G Vk, let 9* be the associated trimmed assignment which set to 0 every 
parameter which is never effectively used to compute the probability Pg(w) of 
some word w. Clearly, 0* is valid and Pg = Pgt. 

For any w, Pg(w) is a polynomial and is therefore a continuous function of 
9. On the other hand, the series Pg{u)) are convergent but not uniformly 
convergent and Pg{wS*) is not a continous function of 9 (see Fig. 1). However, 
we show below that the function (9,w) — >■ Pg{w) is uniformly continuous. 









Fig. 1. PeAA = l/4 + a/2; Pgg{S*) = 1/2 and Pg^{E*) = 1 when a > 0. 



Proposition 1. For any k G TN, the function (9,w) -G Pg{w) is uniformly 
continuous: Ve, 3a, Vw G E*y9, 9' G 14, \\9 — 9'\\ < a \Pg{w) — Pg'{w)\ < e. 
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Proof. We prove the proposition in several steps. 

1. Let 00 G Vfc, let Af> = {S,Qk,(po, i-o,ro) and let (3q = 

ma,x {ipo{q, , Qk) \ Q G Qk}- For any state q s.t. ipo{q, ,Qk) > 0> there 

must exist a word w of length < k and a state q' s.t. (po{q,w,q') yf 0 and 
To(g') yf 0. Hence, Po < 1. 

2. For any integer n and any state q, (fio{q, Qk) < Po- Proof by induction 
on n: clearly true when n = 0 and 

(fio{q, E^'^,Qk) < Po{q, w, q')po{q', QQ 

< Y.q'GQ^.wes’^ Mq, w, q') < PI}. 

3. For any integer n, Pgt{S^'^E*) = ^o{q)po{q, Qk) < PS- 

4. For any state q, <fo{q, EpQQ = E„eiN Em=o Po{q, 17”'=+’”, QQ 

< E„eiN Em=o Eg'GQ, Mq, F’™, q')Mq', Qk) 

< E„gin Em=o 'Eq’eQk f^oMq, 9') < k/{l - Po). 

5. Let cto be the minimal non null parameter in 0}, let a < o;o/2, let 0 be 
a valid assignement such that \\0 — 0o|| < a and let = {E,Qk,(fi,t,T). 
Note that any non null parameter in 0} corresponds to a non null parameter 
in 0* but that the converse is false (see Fig. 1). Let 0' be the assignment 
obtained from 0 * by setting to 0 every parameter which is null in 0 q, let 

= {E,Qk,(p',i',T') and let P' = u\a.x\(p'{q,E^,Qk) \ q & Qk). As 0' and 
0 } have the same set of non null parameters, there exists ai < ao/2 such 
that ||0 — 0o|| < cei implies /3' < (1 + Po)/2. Let /3i = (1 + Po)/2. 

6. Let ic be a word of length > nk. There are two categories of derivations of 
w in A^fS: 

— those which exist in A® . Their contribution to Pgt }w) is not greater than 

PS- 

— those which do not entirely exist in and one parameter of which 
is < ai. Let qo,-.- ,q\w\ be such a derivation. Either pq) < either 
T{q\w\) E CKi, or there exists a first state qi such that qo,-- - ,qi is a 
derivation in Af. and ip{qi,Wi,qi+i) < ai, where Wi is the ith letter of 
w. The contribution of these derivations to Pgt}w) is bounded by 

^ aiip{q,w,Q)+ ^ pq)ip{q,w,q')ai + 

q,i(q)<ai 

X! '^'{qo)v'{qo,^*,q^)o^l<Mk + ^ + k/{l- Pi)) . 
go-neQk 

Therefore, Pgt}w) < PS + apk + 1 + k/{l — /3i)). 

7. Let e > 0. Let = min(o;i, e/[4(fc + 1 + fc/(l — /3i))]) and let N be such 
that PP < e/4. As for any fixed w, Pg{w) is continuous, there exists a < a 2 
such that ||0 — 0oll < « implies that for any w G E-^ , |Peg(w) — Pg{w)\ < e. 
As Pbo{w) < e/2 and Pg{w) < e/2 when |w| > N, we conclude that for all 
words w, \P 0 g{w) — Peiw)] < e. 
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8. We have shown that: Ve,V0o G Vfc,3a,Vw G S* ,'iO G Vk, \\0 — 0o|| < a ^ 
\Pe{w) — P 0 g{w)\ < e. Now, suppose that: 

3e,Vn G IN, G 0^ G Vfc s.t. 

\\ 6 n -0'nW < and \P 0 „{w„) - Pe>^{w„)\ > e 

As valid assignments are elements of a compact set, there would exist a valid 
assignement 9q such that 9a(n) Sq and — >■ 9q (for some subsequence 
cr{n)). We know that there exists a > 0 such that ||0 — 6*o|| < a implies that 
for all w, \Pgg{w) — P 0 {w)\ < e/2. When 1/n < a, the hypothesis leads to a 
contradiction. □ 

Let P G S and let S' be a complete presentation of P. For any integers n and k 
and for any e > 0, let Ie^{Sn, e) be the following system 

l 0 k(S„,e) = Cfc U {\P 0 iw) - Pn(w)\ < e for w G S„}. 



Lemma 2. Let P G S be a stochastic language and let S he a complete pre- 
sentation of P. Suppose that there exists an integer k and a PFA such that 
P = P^o • Then, for any precision parameter e, any confidence parameter S and 
any n >T (e/ 2 , 6 ), with a probability greater thanl — 6 , l 0 f.{Sn,T) has a solution 
that can he computed. 

Proof. From Lemma 1, with a probability greater than 1 — <5, we have \Pg,j{w) — 
Pn{w)\ < e/2 for all w G S„. For any w G S„, Pgiw) is a polynomial in 9 whose 
coefficients are all equal to 1. A bound M„, of can easily be computed. 

We have 



\P 0 {w) - P 0 ,{w)\ < M^\\9 - 9'\\. 

Let a = infl^^^lic G S„}. If ||6* — 6*'|| < a, iPgiw) — Pgi (w)\ < e/2 for all w G S„. 
So, we can compute a finite number of assignments: 9f,... 9f^^ such that for all 
valid assignment 9, there exists 1 < z < Na such that ||6* — 0“|| < a. Let z be 
such that \\9q — 9f\\ < a\ 9f is a, solution of Ie^{Sn, e). □ 

The Borel-Cantelli Lemma is often used to show that a given property holds with 
probability 1: let (A„)„g]N be a sequence of events such that X^neiN < 00 ; 

then, the probability that a finite number of A„ occur is 1. 

For any integer rz, let e„ = rz“5 and = rz“^. Clearly, e„ — >■ 0 and 
X^neiN < 00 . Moreover, there exists an integer N s.t. Vn > N,n > 
f’l i.^nl‘ 2 , Sji). 

Proposition 2. Let P he a stochastic language and let S he a complete presen- 
tation of P. Suppose that there exists an integer k and a PFA such that 
P = P 0 g . With probability 1 there exists an integer N such that for any n> N , 
£ 71 ) has a solution 9n and lim„_>oo Poniw) -G P{w) uniformly in w. 
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Proof. The Borel-Cantelli Lemma proves that with probability 1 there exists an 
integer N s.t. for any n > N, Ie^{Sn, £«) has a solution 0„. Now suppose that 

3e,yN,3n > N,3wn G L"*, |P, 9 „(w„) - P(w„)| > e. 



Let {Oa(n)) be a subsequence of (6*„) such that for every integer n, cr(n) > n, 
\Ps^(n)i'^<T(n)) - P{wa(n))\ > £ and -)> e. As each is a solution of 

l0k{Sa(n)j^a(n))j d \s a, valid assignement such that for all w such that P{w) yf 0, 
P{w) = Pg{w). As P is a stochastic language, we must have P{w) = Pg{w) for 
every word w, i.e. P = Pg. From Proposition 1, Pg^(^.^y converges uniformy to P, 
which contradicts the hypothesis. □ 

It remains to show that when the target cannot be expressed by a PFA on k 
states, the system e„) has no solution from some index. 

Proposition 3. Let P he a stochastic language and let S be a complete presen- 
tation of P. Let k be an integer such that there exists no 9 satisfying P = Pg. 
Then, with probability 1, there exists an integer N such that for any n > N, 
£n) has no solution. 

Proof. Suppose that WN G IN, > iV such that Ie,,{Sn,en) has a solution. Let 
(ni)ig]N be an increasing sequence such that ,e„J has a solution 9i and 

let (6ki) be a subsequence of (9i) that converges to a limit value 9. 

Let w € S* he such that P{w) yf 0. We have \Pg{w) — P{w)\ < \Pg{w) — 
P 0 i{w) \ + \Pg^{w) — Pm{w) \ + \Pn,{w) — P{w)\ for any integer i. 

With probability 1, the last term converges to 0 as i tends to infinity (Lemma 
1). With probability 1, there exists an index i such that w G S^. From this index, 
the second term is less than which tends to 0 as t tends to infinity. Now, as 
Pg{w) is a continuous function of 9, the first term tends to 0 as i tends to infinity. 
Therefore, Pg{w) = P{w) and Pg- = P, which contradicts the hypothesis. □ 



Theorem 1. 5pfa is identifiable in the limit with probability one. 

Proof. Consider the following algorithm A: 

Input: A stochastic sample Sn of length n. 
for fc = 1 to n do { 

compute a and 9f,...9ff^ as in Lemma 2 
if 31 < i < Na s.t. 6f is a solution of l 0 ,,{Sn,en) then 
{return the smallest solution (in some order) }} 
return a default hypothesis if no solution has been found 

Let P be the target and let be a minimal state PFA which computes P. 
Previous propositions prove that with probability one, from some index N, the 
algorithm shall output a PFA A®" such that Pg^ converges uniformely to P. □ 
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3.2 Strong Identification 

When the target can be computed by a PFA whose parameters are in (Q, an 
equivalent PFA can be identified in the limit with probability 1 . In order to show 
a similar property for PDFA, a method based on Stern-Brocot trees was used in 
[10]. Here we use the representation of real numbers by continuous fractions [21]. 

Let X >0. Define xq = x, uq = [xqJ and while yf a„, Xn+i = l/(x„ — a„) 
and ttn+i = \xn\ ■ The sequences (a;„) and (a„) are finite iff a; € Q. Suppose 
from now on that a; G Q, let fV be the greatest index such that Xn yf a at, and 
for any n < N, let the nth convergent of x be the fraction 

Pn/qn = 00 + 1/ (oi + 1/ (• • • (1/On) ’ ’ ’ )) 

where gcd{pm Qn) = 1- 

Lemma 3 ([21]). We have x = — and Wn < N , x — — < — 1 — < . If a 

and b are two integers such that |f — a:| < then there is an integer n < N 
such that f = ^ - For any integer A, there exists only a finite number of rational 

numbers | such that x — ^ ^ ■ 

Let X = 5/14. We have po/qo = 0, pi/qi = 1/2, p 2 /q 2 = 1/3 and ps/qs = x. 

Lemma 4. Let (e„) be a sequence of non negative real numbers which converges 
to 0, let a: G Q, let fyn) he a sequence of elements o/Q such that \x — y„| < e„ 
for all but finitely many n. Let 1+ the convergents associated with y„. Then, 
there exists an integer N such that, for any n > N, there is an integer m such 
that X = ^ . Moreover, ^ is the unique rational number such that y-n — ^ + 

■ 

Proof. Omitted. All proofs omitted here can be found in a complete version of 
the paper available http://www.cmi.univ-mrs.fr/~fdenis . 

Example 1. Let = 1/2 — 1/n and e„ = 1/n. Then y^ = 1/6, y^ = 1/4, 
y5 = 3/10, 7/6 = 1/3, V 7 = 5/14. The first n s.t. Vn ~ ^ ^ F ^ 
a solution is n = 4. Let be the first solution. We have Z4 = 1/4, 25 = 1/3, 
Z6 = 1/3 and = 1/2 for n > 7. 

Theorem 2. Let 5 pfa[Q] be the set of stochastic languages that can he generated 
from a PFA whose parameters are in Q. 5pfa[Q] is strongly identifiable in the 
limit with probability one. 



Proof. Omitted. 
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4 «Sma and «Spfa 

The representation of stochastic languages by MA is not robust. Fig. 2 shows 
two MA which depend on parameter x. They define a stochastic language when 
a; = 0 but not when x > 0. When x > 0, the first one generates negative values, 
and the second one generates unbounded values. 

Let P G 5ma and let A be the MA which generates P output by the exact 
learning algorithm defined in [12]. A sample S drawn according to P defines 
an empiric distribution Ps that could be used by some variant of this learning 
algorithm. In the best case, this variant is expected to output a hypothesis 
A having the same support as A and with approximated parameters close to 
those of A. But there is no guaranty that A defines a stochastic language. More 
seriously, we show below that it is impossible to decide whether a given MA 
generates a stochastic language. The conclusion is that MA representation of 
stochastic languages is maybe not appropriate to learn stochastic languages. 




Fig. 2. Two MA generating stochastic language if x = 0. If x > 0, the first generates 
negative values and the second unbounded values. 



4.1 Membership to <Sma Is Undecidable 

We reduce the decision problem to a problem about acceptor PFA. An MA 
(A, Q, ip, L, t) is an acceptor PFA if p, l and r are non negative functions, 
E,sQ '■(?) = 1. W G <3, VX G A, p{q, X, r) = 1 and if there exists a unique 

terminal state t such that r(t) = 1. 

Theorem 3 ([22]). Given an acceptor PFA A whose parameters are in Q and 
A G Q, it is undecidable whether there exists a word w such that Pa{w) < A. 

The following lemma shows some constructions on MA. 

Lemma 5. Let A and B he two MA and fet A G Q. We can construct: 

1. a MA I\ such that Vic G A*, Pj,^{w) = X, 

2. a MA A + B such that Pa+ b = Pa + Pb 

3. a MA A • A such that P\.a = XPa, 

4- a MA tr(A) such that for any word w, Ptr{A){w) = q 
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A-d-) 

A-d-) 

~ A 



o 

o 



A 




Fig. 3. How to construct I\, A -\- B, \ ■ A and tr(H), where n = lYI + 1. 
Note that when A is an acceptor PFA, tr(A) is a semi-PFA. 



Proof. Proofs are omitted. See Fig. 3. 

Lemma 6. Let A = {S,Q,ip,L,T) be a semi-PFA, let Q* = {q £ 
Q\ip{Qi,S*,q) > 0 and ip{q,S*,QT) > 0} and let A* = {E,Q\(p\Q^, 

Then, A* is a trimmed semi-PFA such that Pa = Pa* cind which can he con- 
structed from A. 

Proof. Straightforward. 



Lemma 7. Let A he a trimmed semi-PFA, we can compute Pa (L"*). 

Proof. Omitted. 

Proposition 4. Lt is undecidahle whether a MA generates a stochastic language. 

Proof. Let A be an acceptor PFA on E and A G Q. For every word w, we have 
Ptr(A-IA H = (1^1 + + (PaH - A) = Ptr(A)H ~ A(|A| + 1)-(I“I + D 

and therefore Ptr(A-ix) (^*) = ^tr(A) (^*) ~ A. 

— If Ptr(A) (Af*) = A then either s.t. Pa(w) < A or \/w,Pa{w) = A. Let 
B be the PFA such that Pb{w) = 1 if rc = e and 0 otherwise. We have, 
PB+tr(A-ix) (A'*) = 1- Therefore, \/w, Pa(w) > A iff Pa{A) > A and B + 
tr (A — I\) generates a stochastic language. 

— If Ptr(A) (Af*) yf A, let B = |Ptr(A) (Af*) — A| ^ • tr (A — I\) . Check that B 
is computable from A, that Pb{E*) = 1 and that 

Pb{w) = |Ptr(A) (A:*) - A|-' (card (A + 1)1“'+') {Pa{w) - A) . So, 3w £ 
E*,Pa{w) < X iS B does note generate a stochastic language. 

In both cases, we see that deciding whether a MA generates a stochastic language 
would solve the decision problem on PFA acceptors. □ 

Remark that in fact, we have proved a stronger result: it is undecidable whether 
a MA A such that Pa{w) = 1 generates a stochastic language. As a 

consequence, it can be proved that there exist stochastic languages that can be 
computed by MA but not by PFA. 

Theorem 4. 5 pfa $1 5ma- 

Proof. Omitted. 
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5 Learning PRFA 

The inference algorithm given in Section 3 is highly inefficient and cannot be 
used for real applications. It is unknown whether PFA can be efficiently learned. 
Here, we study a subclass of PFA, for which there exists a learning algorithm 
which can be efficiently implemented. 

5.1 Probabilistic Residual Finite Automata 

Definition 2 (Probabilistic Residual Finite Automaton). A PRFA is a 
PFA A = {S l,t) whose states define residual languages of Pa, i.e.such 
thatVq € Q,3u € S* , PA,q = u~^Pa, where PA,q denotes the stochastic language 
generated by < S Lq,r > where Lq{q) = 1 [16]. 

Remark that PDFA are PRFA but that the converse is false. Fig. 4 represents 
a PRFA where E = {a, 6}, Q = {s,a,b}, l{s) = 1, r(6) = |, 

ip{s,a,a) = (p{e,b,b) = (f{a,a,a) = (p{a,a,b) = | and (f{b,a,b) = 




Fig. 4. A prefix PRFA. 



Let P be a finite subset of S. The convex closure of V is denoted by 
conv (P) = {P G 5| 3Pi, . . . , P„ G P, 3Ai, . . . A„ > 0, P = We say 

that P is a residual net if for every Q G V and every u G res(Q), u~^Q G 
conv (P). A residual net P is a convex generator for P G 5 if P G conv (P). 

It can be shown that 5 pdfa £ ^prfa $1 5pfa Si 5 ma £ S [16]. More 
precisely, let P G S: 

— P G iSpDFA iff P has a finite number of residual languages. 

— P G iSpRFA iff there exists a convex generator for P composed of residual 
languages of P. 

— P G iSpFA iff there exists a convex generator for P. 

— P G 5ma iff res (P) spans a finite dimensional vector space. 

Any P G iSpDFA can be generated by a minimal (in number of states) PDFA 
whose states correspond to the residual languages of P. In a similar way, it can 
be shown that any P G 5prfa has a unique minimal convex generator, composed 
of prime residual languages of P which correspond to the states of a minimal 
(in number of states) PRFA generating P (see [17] for a complete study). Such 
a canonical form does not exist for PFA or MA. 
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A PRFA A = (U, Q, ip, t, r) is prefix if Q is a prefixial subset of A*, t (e) = 1, 
and V(w, x,v) € Q x S x Q, p{u, x,v) 0 implies ux = v or ux ^ Q. Transitions 
of the form (u,x,ux) are called internal transitions] the others are called return 
transitions. For example, automaton on Fig. 4, which can be built on the set 
{e, a, 6}, is a prefix PRFA, the transitions (e, a, a) and (e, 6, b) are internal while 
(a, a, a), (a,a,b) and (b,a,b) are return transitions. Prefix PRFA are sufficient 
to generate all languages in 5 prfa- 

Let P G 5pRFA) Pm(P) is the smallest prefixial subset of S* such that 
Vu G Pm (P), Vx G A n res {ux)~^P G conv ({?; G Pm (P) | v < ux}) 

ux ^ Pm{P). Let Uux = {u G Pm (P) | v < ux} and for any word u,v G Pm (P), 
any x G A let (a“"^)we(7u* be positive parameters such that {ux)~^P = 
Consider now the following PFA Ap = (A, Pm (P) , (/?, i, r) 
where t(e) = 1, ip{u,x,v) = P{uxE*) / P{uE*) \i v = ux and (p{u,x,v) = 
al}^P{uxS*)/P{uS*) if {ux)~^P = proved that 

Ap is a prefix PRFA which generates P [16]. See Fig. 4 for an example, where 
Pm (P) = {e, a, b}. 

5.2 The Inference Algorithm 

For any finite prefixial set Q, let 0q = {lu,Tu,Pu,v \ u,v G Q,x G A} be a set 
of variables. We consider the following set of constraints Cq on 0q: 

' 0 < t„, r„, <l for any u,v G Q,x G S, 

P = 1 

Cq = < = 0 for any m G Q \ {e} , (2) 

Tu + ^u,v = 1 for any u G Q, 

. V’m t) = b for any w, v, x s.t. ux v and ux G Q . 

Any assignment 9 of these variables satisfying Cq defines a prefix PRFA A^ . 

Let P G S, let S' be a complete presentation of P, for any finite prefix- 
ial set Q, any e > 0, any integer n and any v G res (P) such that Vm G Q, 
V > u, let Iqq (w, S„, e) be the following system: (u, S„, e) = Cq U Cntemai U 

Creturn(^) where Cinternal — {-Cie (u:) = Pn(w^\ W G Q} and Creturn('Cx) (x G A) 

X 

is the set of constraints (ux)~^P„ {wS*) — X^msq p (wA*) <e 

for all w G pref (S„) successors of vx. Let l 0 Q{Sn,e) = Cq U 
Cinternal Ur;a:efr(Q,Pn) ^return(P^)- 

The constraint set Cntemai can be solved immediatly and give parame- 
ters of the internal part of the automaton. It can be solved with L{e) = 1, 
V(m,x,mx) G Q X S X Q, p{u,x,ux) = Pn{uxS*) / Pn{uE*) and for all u G Q, 
t{u) = Pn{u) / Pn{uS*) . Creturn IS used to get parameters of return transitions. 
Remark (S„,e) is a system composed of linear inequations. 

Let DEES be the following algorithm: 

Input: a stochastic sample Sn 
Output: a prefix PRFA A= {U,Q, p, l,t) 

Q — {£■} , P — An res (Pn) 
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while 7 ^ 0 do { 

V = min F, F F \ {«} 

if 7 ©q (u, S'n, £n) has no solution then{ 

<5 t— Q U {w} , F ■(— FU {vx € res (P„) | x € L7}}} 
if l 0 Q{Sn,tn) has some solution A® then return A®, 
else return the prefix tree automaton of Sn- 

DEES identifies 5 prfa in the limit with probability 1. 

Theorem 5. Let P € 5prfa and let S he a complete presentation of P, then 
with probability one, there exists N G TN, such that for any n > N , the set of 
states o/DEES(S'„) is Pm(P) and Pdees(S„) converges to P. 

Proof. It can be proved that, with probability one, after some rank, loQ^Sn, e„) 
has solutions if and only if there exists a prefix PRFA A^ such that = P. 
More precisely, it can be shown that Pm (P) is identified as the set of states 
from some index. Proofs are similar as the proofs of Prop. 2 and Prop. 3. □ 

Example. The the target be the prefix PRFA of Fig. 4. Let S 20 be the sam- 
ple such that pref (S' 20 ) = {(e : 20), (a : 12), (6 : 8),(ao : 12), (6a : 2 ),(aaa : 
11), (baa : 1), (aaaa : 4), (aaaaa : 3), (aaaaaa : 2)} where (u : n) means that n 
occurrences of u are counted. 




Fig. 5. DEES on S' 20 - 



In the first step of the algorithm, Q = {e} (see Fig. 5.1). 
l 0 Q(a, Sn, e) is the system: 



a-ip„(A*) - p^£-ip„ (A*) 
a-ip„(aP*) - (aS*) 



< € 

<e^ 



I - in°- ^ . 1 
re.e 12 

12 _ a 20 

12 Fe,e 12 



12 

20 



< e 

< e 



which has no solution. Then we add the state a to Q (see Fig. 5.2). In the second 
step, Q = {s,a} and 70^(6, S 2 o,e) has no solution. Then 6 is added to Q (see 
Fig. 5.3). In the third step, Q = {e, a,6} and as = 0, = 0,556 and 

— 0)444 is a solution of l 0 Q(aa, Sn,e), we construct the automaton with 
these values (see Fig. 5.4). In the last step, Q = {e, a, 6}, and = (/?^^ = 0, 
Fbb ~ 0,24 is a valid solution of 70^(60, An, e). The returned automaton is a 
prefix PRFA close to the target represented on Fig. 4. 
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6 Conclusion 

We have shown that PFA are identifiable in the limit with probability one, that 
representing stochastic languages using Multiplicity Automata presents some se- 
rious drawbacks and we have proposed a subclass of PFA, the class of PRFA, 
and a learning algorithm which identifies this class and which should be imple- 
mented efficiently. In the absence of models which could precisely measure the 
performances of learning algorithms of PFA, we plan to compare experimentally 
our algorithm to other learning algorithms used in this field. We predict that we 
shall have better performances than algorithms that infer PDFA, since PRFA is a 
much more expressive class, but this has to be experimentally established. The 
questions remain whether richer subclasses of PFA can be efficiently inferred, 
and what is the level of expressivity needed in practical learning situations. 
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Abstract. This paper deals with two well discussed, but largely open 
problems on E-pattern languages, also known as extended or erasing 
pattern languages: primarily, the learnability in Gold’s learning model 
and, secondarily, the decidability of the equivalence. As the main result, 
we show that the full class of E-pattern languages is not inferrable from 
positive data if the corresponding terminal alphabet consists of exactly 
three or of exactly four letters - an insight that remarkably contrasts 
with the recent positive finding on the learnability of the subclass of 
terminal-free E-pattern languages for these alphabets. As a side-effect of 
our reasoning thereon, we reveal some particular example patterns that 
disprove a conjecture of Ohlebusch and Ukkonen ( Theoretical Computer 
Science 186, 1997) on the decidability of the equivalence of E-pattern 
languages. 



1 Introduction 

In the context of this paper, a pattern - a finite string that consists of variables 
and terminal symbols - is used as a device for the definition of a formal language. 
A word of its language is generated by a uniform substitution of all variables with 
arbitrary strings of terminal symbols. For instance, the language generated by the 
pattern a = XiXi a b X2 (with x\, X 2 as variables and a, b as terminals) includes 
all words where the prefix can be split in two occurrences of the same string, 
followed by the string ab and concluded by an arbitrary suffix. Thus, the language 
of a contains, among others, the words w\ = aaaba, = abababab, 
w-i = a b b b, whereas the following examples are not covered by a: v\ — h a, 
V 2 — b b b b b, U3 = b a a b a. Consequently, numerous regular and nonregular 
languages can be described by patterns in a compact and “natural” way. 

The investigation of patterns in strings - initiated by Thue in [22] - may 
be seen as a classical topic in the research on word monoids and combinatorics 
of words (cf. [19]); the definition of pattern languages as described above goes 
back to Angluin [1]. Pattern languages have been the subject of several analy- 
ses within the scope of formal language theory, e.g. by Jiang, Kinber, Salomaa, 
Salomaa, Yu [7], [8]) - for a survey see [19] again. These examinations reveal 
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that a definition disallowing the substitution of variables with the empty word 

- as given by Angluin ~ leads to a language with particular features being quite 
different from the one allowing the empty substitution (that has been applied 
when generating in our example). Languages of the latter type have been 
introduced by Shinohara in [20]; contrary to those following Angluin’s defini- 
tion (called Ai?-pattern languages), they are referred to as extended, erasing, or 
simply i?-pattern languages. 

Particularly for E-pattern languages, a number of fundamental properties is 
still unresolved; one of the best-known open problems among these is the decid- 
ability of the equivalence, i.e. the question on the existence of a total computable 
function that, given any pair of patterns, decides whether or not they generate 
the same language. This problem, that for NE-pattern languages has a trivial 
answer in the affirmative, has been discussed several times (cf. [7], [8], [5], and 
[12]), contributing a number of conjectures, conditions and positive results on 
subclasses, but no comprehensive answer. 

When dealing with pattern languages, manifold questions arise from the prob- 
lem of computing a pattern that is common to a given set of words. Therefore, 
pattern languages have been a focus of interest of algorithmic learning theory 
from the very beginning. In the elementary learning model of inductive inference 

- known as learning in the limit or Gold style learning (introduced by Gold in 
1967, cf. [6]) - a class of languages is said to be inferrable from positive data if and 
only if a computable device (the so-called learning strategy) - that reads growing 
initial segments of texts (an arbitrary stream of words that, in the limit, fully 
enumerates the language) - after finitely many steps converges for every lan- 
guage and for every corresponding text to a distinct output exactly representing 
the given language. In other words, the learning strategy is expected to extract 
a complete description of a (potentially infinite) language from finite data. Ac- 
cording to [6], this task is too challenging for many well-known classes of formal 
languages: All superfinite classes of languages - i.e. all classes that contain every 
finite and at least one infinite language - such as the regular, context-free and 
context-sensitive languages are not inferrable from positive data. Consequently, 
the number of rich classes of languages that are known to be learnable is rather 
small. Finally, it is worth mentioning that Gold’s model has been complemented 
by several criteria on language learning (e.g. in [2]) and, moreover, that it has 
been transformed into a widely analysed learning model for classes of recursive 
functions (cf., e.g., [4], for a survey see [3]). 

The current state of knowledge concerning the learnability of pattern lan- 
guages considerably differs when regarding NE- or E-pattern languages, respec- 
tively. The learnability of the class of NE-pattern languages was shown by An- 
gluin when introducing its definition in 1980 (cf. [1]). In the sequel there has 
been a variety of additional studies - e.g. in [9], [23], [17] and many more (for 
a survey see [21]) - concerning the complexity of learning algorithms, conse- 
quences of different input data, efficient strategies for subclasses, and so on. The 
question, however, whether or not the class of E-pattern languages is learnable 

- considered to be “one of the outstanding open problems in inductive infer- 
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ence” (cf. [11]) - remained unresolved for two decades, until it was answered in 
[14] in a negative way for terminal alphabets with exactly two letters. Positive 
results on subclasses have been presented in [20], [11], [13], and [15]. Moreover, 
[11] proves the full class of E-pattern languages to be learneable for infinite and 
unary alphabets as these alphabets significantly facilitate inferrability. 

In the present paper we show that the class of E-pattern languages is not 
inferrable from positive data if the corresponding terminal alphabet consists of 
exactly three or of exactly four letters (cf. Section 3) . We consider this outcome 
for the full class of E-pattern languages as particularly interesting as it con- 
trasts with the results presented in [14] and [15]. The first proves the class of 
E-pattern languages not to be learnable for binary alphabets since even its sub- 
class of terminal-free E-pattern languages (generated by patterns that consist of 
variables only) is not learnable for these alphabets. Contrary to this, the latter 
shows that the class of terminal-free E-pattern languages is inferrable if the cor- 
responding terminal alphabet contains more than two letters. Consequently, with 
the result of the present paper in mind, for E-pattern languages there obviously 
is no general way to extend positive findings for the terminal-free subclass on 
the full class. The method we use is similar to the argumentation in [14], i.e. we 
give for both types of alphabets a respective example pattern with a certain 
property which can mislead any potential learning strategy. The foundations of 
this way of reasoning - that, as in [14], is solely made possible by an appropriate 
alphabet size and the nondeterminism of E-pattern languages - are explained 
in Section 2. Finally, in Section 4 one of our example patterns is shown to be 
applicable to the examinations on the equivalence problem by Ohlebusch and 
Ukkonen in [12], disproving the central conjecture given therein. 



2 Preliminaries 

In order to keep this paper largely self-contained we now introduce a num- 
ber of definitions and basic properties. For standard mathematical notions and 
recursion-theoretic terms not defined explicitly, we refer to [18]; for unexplained 
aspects of formal language theory, [19] may be consulted. 

N is the set of natural numbers, {0, 1,2,...}. For an arbitrary set A of sym- 
bols, A~^ denotes the set of all non-empty words over A and A* the set of all 
(empty and non-empty) words over A. Any set L C A* is a language over an 
alphabet A. We designate the empty word as e. For the word that results from 
the n-fold concatenation of a letter a or of a word w we write a” or w", respec- 
tively. The size of a set A is denoted by ]Aj and the length of a word w by jwj; 
jwja is the frequency of a letter a in a word w. 

For any word w that contains at least one occurrence of a letter a we define 
the following subwords: [w/ a] is the prefix of w up to (but not including) the 
leftmost occurrence of the letter a and [a\w] is the suffix of w beginning with 
the first letter that is to the right of the leftmost occurrence of a in w. Thus, 
the specified subwords satisfy w = [w/ a] a [a\rc]; e.g., for w = bcaab, the 
subwords read [w/ a] = b c and [a\w] = ab. 
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We proceed with the pattern specific terminology. H is a finite or infinite al- 
phabet of terminal symbols and X = {xi,X2,X3, . . . } an infinite set of variables, 
S C\ X = %. Henceforth, we use lower case letters in typewriter font, e.g. a, b, c, 
as terminal symbols exclusively; words of terminal symbols are named as u, v, 
or w. For every j > 1, the variable yj is unspecified, i.e. there may exist indices 
k,k' such that k yf k', but yk = yk'- For unspecified terminal symbols we use 
upper case letters in typewriter font, such as A. 

A pattern is a non-empty word over S U X, a terminal-free pattern is a 
non-empty word over X; naming patterns we use lower case letters from the 
beginning of the Greek alphabet, var(a) denotes the set of all variables of a 
pattern a. We write Pati; for the set (XUX)^ and we use Pat instead of Pati; 
if E is understood. The pattern x(a) derives from any a G Pat removing all 
terminal symbols; e.g., x(xiXi ax2b) = xiXiX2- 

Following [5], we designate two patterns a, (3 as similar if and only if a = 
ao UiaiU2 ... Om-l Um <Xm and /3 = /3 q Mi /3l U2 ... Pm-l Um Pm with TO G N, 
ai,Pi G X^ for \ < i < m, aQ, Po, am, Pm G X* and Ui G E^ for z < to; in other 
words, we call patterns similar if and only if their terminal substrings coincide. 

A substitution is a morphism a : {E U X)* — > E* such that (r(a) = a for 
every a G E. An inverse substitution is a morphism a : E* — > X* . The E- 
pattern language Ls{a) of a pattern a is defined as the set of all w G E* such 
that <j{a) = w for some substitution a. For any word w = a{a) we say that a 
generates w, and for any language L = Ls{a) we say that a generates L. If there 
is no need to give emphasis to the concrete shape of E we denote the E-pattern 
language of a pattern a simply as L{a). We use ePATj; (or ePAT for short) as 
an abbreviation for the full class of E-pattern languages over an alphabet E. 

Following [11], we designate a pattern a as succinct if and only if jaj < \P\ 
for all patterns P with L{P) = L{a). The pattern P = X1X2X1X2, for instance, 
generates the same language as the pattern a = xixi, and therefore P is not 
succinct; a is succinct because there does not exist any shorter pattern than a 
that exactly describes its language. 

According to the studies of Mateescu and Salomaa on the nondeterminism 
of pattern languages (cf. [10]) we denote a word w as ambiguous (in respect 
of a pattern a) if and only if there exist two substitutions a and a' such that 
a{a) = w = <j'{a), but a{xi) yf cr'(xi) for some Xi G var(a). The word w = aaba, 
for instance, is ambiguous in respect of the pattern a = x\ax2 since it can be 
generated by several substitutions, such as a and a' with a(xi) — a, a{x2) = ba 
and cr'(xi) = e, a'{x2) = aba. 

We now proceed with some decidability problems on E-pattern languages: 
Let ePAT* be any set of E-pattern languages. We say that the inclusion problem 
for ePAT* is decidable if and only if there exists a computable function which, 
given two arbitrary patterns a, P with L{a),L{P) G ePAT*, decides whether 
or not L(a) C L(P). Correspondingly, the equivalence problem is decidable if 
and only if there exists another computable function which for every pair of 
patterns a, P with L{a),L{P) G ePAT* decides whether or not L{a) = L{P). 
Obviously, the decidability of the inclusion implies the decidability of the equiva- 
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lence. The decidability of the equivalence problem for ePAT has not been resolved 
yet (cf. Section 4), whereas the inclusion problem is known to be undecidable 
(cf. [8]). Under certain circumstances, however, the inclusion problem is decid- 
able; this is a consequence of the following fact: 

Fact 1 (Ohlebusch, Ukkonen [12]). Let S he an alphabet and a,P two arbi- 
trary similar patterns such that S contains two distinct letters not occurring in a 
and (3. Then Ls{!3) C Ls{oi) iff there exists a morphism <f> : var(o;)* — > var(/3)* 
with 4>{a) = (3. 

In particular. Fact 1 implies the decidability of the inclusion problem for the 
class of terminal-free E-pattern languages if the alphabet contains at least two 
distinct letters (shown in [8]). 

This paper exclusively deals with language theoretical properties of E-pattern 
languages. Both motivation and interpretation of our examination, however, are 
based on learning theory, and therefore we consider it useful to provide an ade- 
quate background. To this end, we now introduce our notions on Gold’s learning 
model (cf. [6]) and begin with a specification of the objects to be learned. In this 
regard, we restrict ourselves to any indexable class of non-empty languages] a 
class C of languages is indexable if and only if there exists an indexed family ( of 
non-empty recursive languages) (Li)igN such that £ = {Li | i G N} - this means 
that the membership is uniformly decidable for (£j)jgN, i.e. there is a total and 
computable function which, given any pair of an index i G N and a word w G S* , 
decides whether or not w € Li. Concerning the learner’s input, we exclusively 
consider inference from positive data given as text. A text for an arbitrary lan- 
guage L is any total function t : N — >■ S* satisfying [t{n) | n G N} = £. 
For any text t, any n G N and a symbol O ^ A, t" G (A U {O})’'’ is a cod- 
ing of the first n -\- 1 values of £ i.e. t” := t(0) O t(l) O t{2) . . . O t{n). Last, 
the learner and the learning goal need to be explained: Let the learner (or: the 
learning strategy) S be any total computable function that, for a given text t, 
successively reads ff, ff, etc. and returns a corresponding stream of natural 
numbers and so on. For a language Lj and a text t for Lj, 

we say that S identifies Lj from t if and only if there exist natural numbers no 
and j' such that, for every n > ng, Sft^) = j' and, additionally, Lj/ = Lj. An 
indexed family {Li)i^^ is learnable (in the limit) - or: inferrable from positive 
data, or: (Li)igN G LIM-TEXT for short - if and only if there is a learning strat- 
egy S identifying each language in {Li)i^jq from any corresponding text. Finally, 
we call an indexable class £ of languages learnable (in the limit) or inferrable 
from positive data if and only if there is a learnable indexed family (Li)igN with 
£ = [Li I i G N}. In this case we write £ G LIM-TEXT for short. 

In fact, the specific learning model given above - that largely is based on [2] - 
is just a special case of Gold’s learning model, which frequently is considered for 
more general applications as well. For numerous different analyses the elements of 
our definition are modified or generalised, such as the objects to be learned (e.g., 
using arbitrary classes of languages instead of indexed families) , the learning goal 
(e.g., asking for a semantic instead of a syntactic convergence), or the output of 
the learner (choosing a general hypothesis space instead of the indexed family). 
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Concerning the latter point we state that for the case when the LIM-TEXT 
model is applied to an indexed family, the choice of a general hypothesis spaces 
instead of the indexed family itself does not yield any additional learning power. 
For information on suchlike aspects, see [24]. 

Angluin has introduced some criteria on indexed families that reduce learn- 
ability to a particular language theoretical aspect (cf. [2]) and thereby facilitate 
our approach to learnability questions. For our purposes, the following is suffi- 
cient (combining Condition 2 and Corollary 1 of the referenced paper): 

Fact 2 (Angluin [2]). Let (Lj)jgN be arbitrary indexed family of non-empty 
recursive languages. If (Ai)igN G LIM-TEXT then for every j G N there exists a 
set Tj such that 

— Tj C Lj, 

— Tj is finite, and 

— there does not exist o j' G N with Tj C Lji C Lj. 

If there exists a set Tj satisfying the conditions of Fact 2 then it is called a 
telltale (for Lj) (in respect of (Li)igN). 

The importance of telltales - that, at first glance, do not show any connection 
to the learning model - is caused by the need of avoiding overgeneralisation 
in the inference process, i.e. the case that the strategy outputs an index of a 
language which is a proper superset of the language to be learned and therefore, 
as the input consists of positive data only, is unable to detect its mistake. Thus, 
every language Lj in a learnable indexed family necessarily contains a finite set 
of words which, in the context of the indexed family, may be interpreted as a 
signal distinguishing the language from all languages that are subsets of Lj. 

With regard to E-pattern languages. Fact 2 is applicable because ePAT is 
an indexable class of non-empty languages. This is evident as, first, a recursive 
enumeration of all patterns can be constructed with little effort and, second, 
the decidability of the membership problem for any pattern a G Pat and word 
w G X* is guaranteed since the search space for a successful substitution of a is 
bounded by the length of w. 

Thus, we can conclude this section with a naming for a particular type of 
patterns that has been introduced in [14] and that directly aims at the content 
of Fact 2: A pattern /3 is a passe-partout (for a pattern a and a finite set W of 
words) if and only if IT C L{(3) and L{f3) C L{a). Consequently, if there exists 
such a passe-partout j3 then IT is not a telltale for L{a). 

3 The Main Result 

When asking for the learnability of the class of E-pattern languages then, because 
of the different results on unary, binary and infinite terminal alphabets (cf. [11] 
and [14]), it evidently is necessary to specify the size of the alphabet. Keeping this 
in mind, there are some results on the learnability of subclasses that are worth to 
be taken into consideration, namely [20] and [15]. The first shows that the class 
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of regular E-pattern languages is learnable; these are languages generated by 
patterns a with \a\xj = 1 for all Xj G var(a). Thus, roughly speaking, there is a 
way to algorithmically detect the position and the shape of the terminal symbols 
in the pattern from positive data. On the other hand, the latter publication shows 
that the class of terminal-free E-pattern languages is learnable if and only if the 
terminal alphabet does not consist of exactly two letters, or, in other words, that 
it is possible to extract the dependencies of variables for appropriate alphabets. 
However, our main result states that these theorems are only valid in their own 
context (i.e. the respective subclasses) and, consequently, that the combination 
of both approaches is impossible: 

Theorem 1. Let S be an alphabet, lifl G {3,4}. Then ePATj; ^LIM-TEXT. 

The proof of this theorem is given in the subsequent section. 

Thus, with Theorem 1 and the results in [11] and [14], the learnability of 
the class of E-pattern languages is resolved for infinite alphabets and for finite 
alphabets with up to four letters. Concerning finite alphabets with five or more 
distinct letters we conjecture - as an indirect consequence of Section 3.1 - that 
the question of learnability for all of them can be answered in the same way: 

Conjecture 1. Let Si,S 2 be arbitrary finite alphabets with at least five letters 
each. Then ePATu, G LIM-TEXT iff ePATi;, G LIM-TEXT. 



3.1 Proof of the Main Result 

First, we give an elementary lemma on morphisms, that can be formulated in 
several equivalent ways; however, with regard to the needs of the subsequent 
reasoning on Lemma 2 and Lemma 3 (that provide the actual proof of Theo- 
rem 1), we restrict ourselves to a rather special statement on mappings between 
terminal-free patterns. Although the fact specified therein may be considered 
evident we additionally give an appropriate proof sketch in order to keep this 
paper self-contained. 

Lemma 1. Let a, /3 be terminal-free patterns and 0, ip morphisms with 0(a) = /3 
and ipiP) = a. Then either tp{4>{xj)) = Xj for every Xj G var(a) or there exists 
an Xj' G var(a) such that |0(0(a;j'))| > 2 and Xj> G var(0>(0(a;j/))). 

We call any Xj> satisfying these two conditions an anchor variable (in respect of 
0 and tp). 

Proof Let a := yiy 2 Vz ■■■Vm] then (3 = 0(yi)0(y2)0(y3) ■ • ■ 0(2/m)- Let yuo be the 
leftmost variable such that '4’{4>{yko)) 2/feo- Now assume to the contrary there is 
no anchor variable in a. Then 0(0(?/fco)) necessarily equals e as otherwise 0(/3) yf 
a. Hence, |0(0(yi)) 0(0(?/2)) f’i'fiya)) ■ ■ ■ V'(<('(yfco))l = - 1, and obviously, as 

there is no anchor variable in a, | 0 ( 0 (yi)) 0 ( 0 (y 2 )) 0 ( 0 (?/ 3 )) • ■ • 0(0(yfc))| < k — 1 
for every k > ko- Consequently, |0(/3)| < |a| and therefore 0(0) yf a. This 
contradiction proves the lemma. □ 
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We now proceed with the patterns that are crucial for our proof of Theorem 1. 
Contrary to the simply structured pattern used in [14] as an instrument for the 
negative result on binary alphabets, the examples given here unfortunately have 
to be rather sophisticated: 

Definition 1 . The patterns «abc (md Oabcd (ire given by 

Oabc := a X2 3^3 X5 Xg a X7 a X2 Xg X4 Xg Xg , 

ttabcd := x-i a X2 X3 X4 Xg Xg X7 Xg b Xg a Xg X4Q X4 Xg Xg X44 Xg b X12 . 

Qfabc is used in Lemma 2 for the proof of Theorem 1 in case of alphabets with 
exactly three letters and Oabcd in Lemma 3 for those with four. In these lemmata 
we show that L(aabc) and L(aabcd) for their particular alphabets do not have 
any telltale in respect of ePAT. 

First, due to the intricacy of these patterns, we consider it helpful for the 
understanding of the proofs of the lemmata to briefly discuss the meaning of some 
of their variables and terminal symbols in our reasoning; we focus on Oabc since 
CKabcd is a natural extension thereof. Our argumentation on the lemmata utilises 
the insight that, with regard to E-pattern languages, the ambiguity of a word 
decides on the question of whether this word can be a useful part of a telltale. 
For instance, concerning the pattern ag := x^x^x^, that makes up the core of our 
example patterns, it is shown in [14] and [15] that any telltale of L(ao) necessarily 
has to contain particular words which consist of three distinct letters in order to 
avoid a specific and unwanted kind of ambiguity. However, if for any substitution 
a that is applied to a\ := xi ax 2 x|o;o ~ which is a prefix of Oabc ~ cr(o;o) contains 
all three letters of the alphabet and, thus, includes the letter a then a{ai) 
again is ambiguous and always may be generated by a second substitution a' 

with (j'(ao) = e, cr'(xi) = (j(xi ax2x|)[cr(ao)/ a-]; <x'{x2) = [a\cr(ao)]- With a', 

in turn, we can give an inverse substitution leading to a tailor-made pattern 
that assuredly can be part of a passe-partout. Thus, for ai we can state the 
desired gap between, on the one hand, the need of substituting ao by three 
different letters and, on the other hand, the ambiguity of all words that conform 
to this requirement. However, due to the unique variable xg in «i, the language 
generated by Oi evidently equals that of «2 := a^i ax 2 , turning the core substring 
ao to be redundant. Therefore, ai has to occur at least twice in the pattern (with 
an optional separating occurrence of the letter a). Since in the pattern ai aoi 
still both occurrences of the substring ag are redundant, the second occurrence 
of is transformed into := xy ax 2 x|ao- Hence, Oatc = <ai aa^. 

With regard to Oabcd, the underlying principle is similar. As stated above, 
three distinct letters are needed for an appropriate telltale substitution ct of ag- 
However, if b,c,d are chosen as these letters, the desired ambiguity of cr(ai) 
cannot be guaranteed. Hence, in aabcd is extended to di := 01 X 7 X 3 bxg, 
such that every cr(di) is ambiguous as soon as cr(o;o) contains the letters a or b. 
Furthermore, due to the reasons described above, a modification of di serves as 
suffix of Oabcd, namely a'^ := Xg axgxfgOgxf^Xs bxig. Contrary to the structure 
of Oabc) the prefix a\ and the suffix a'l in this case are not separated by a terminal 
symbol, but they are overlapping. 
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Now we specify and formalise the approach discussed above: 

Lemma 2. Let S := {a, b,c}. Then for a^-oc and for every finite W C Li;(Q;abc) 
there exists a passe-partout f3 € Pat. 

Proof. If W is empty then the claim of Lemma 2 holds trivially. Hence, let 
W = {wi,W 2 ,ws, . . . ,Wn} be non-empty. Then, a,s W C Ls{aa.hc), for every 
Wi G W there exists a substitution ai satisfying cri(o;abc) = Wi- Using these Oi 
the following procedure constructs a passe-partout j3 € Pat: 

Initially, we define 

Po := 71.0 a 72,0 7I.0 74,0 lip lip a 77.0 a 72,0 lip lip lip lip 
with ^jp := e for every j, 1 < j < 8. 

For every Wi G W we define an inverse substitution di : S* — > X* by 



r x^i-2 , 


A = a 


CTi(A) := xzi-i , 


II 

< 


[ X3t , 


A = c 



For every i = 1, 2, 3, . . . , n we now consider the following cases: 

Case 1: There is no A G U with \ai{xe)\n = 1 and |cTi(x(aabc)) |a = 4 
Define := 7j,j-i dt(ai(xj)) for every j, 1 < j < 8 . 

Case 2: There is an A G U with \ai{xe)\n = 1 and |cTi(x(aabc)) |a = 4 
Case 2.1: A = a 

Define 7i.» := 7i.i-i a. X 2 xlx\xl)) di{[ai{xl) / a\) , 

l 2 ,^ ■■= 72,i-i dj([a\cT,(a;§)]) , 

17, t ■= l7,i-i di{a^{xr a X 2 x% xl x|)) di{[ai{xl)/ a]) , 
lJ,^ ■■=lj,t-i, j G {3, 4, 5, 6, 8}. 

Case 2.2: A = b 

Case 2.2.1: apxl x^) G {a}* U {c}* 

Define 74 ,* := 74.i-i d-i{aPx 4 X 5 )) , 

75.1 •= 75,i-i d'i{ai(xQ )) , 

1 6 .1 T6,i— 1 J 

71. * :=7j.*-i diia^Xj)), j G {1,2, 3, 7, 8}. 

Case 2.2.2: Xg) G (a, c}+ \ (|a}+ U |c}+) 

Define 71 ,* := 7 iy-i ^■i{a^{xl a X 2 xD) xl)/ a]) , 

72, * := 72,i-i di{[a.\( 7 ^{x\ x\ x|)]) , 

77.* := 77,*-i ^i{(x^{x^ a X 2 a;|)) d,{[ai{xl xl)/ a]) , 
lj,^ ■=lj,i-i, j G {3, 4, 5, 6, 8}. 

Case 2.3: A = c 

Adapt case 2.2 replacing c by b in the predicates of cases 2.2.1 and 2.2.2. 
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Finally, define 

A := li,i a 72.* 73 .* tI.* ll,i a 77.* a 72.* 7I.* 7^.* 7^.* 7e.* ■ 

When this has been accomplished for every t, 1 < i < n, then define f3 := /3„. 

Now, in order to conclude the proof, the following has to be shown: /3 is a 
passe-partout for «abc and W , i.e. 

1. LF C L{(3) and 

2. L{(3) C L(aabc). 

ad 1. For every z, 1 < i < n, we define a substitution by 

fa , j = 3z - 2 , 

[ e , else . 

If Wi satisfies case 1 then obviously cr'(/3) = wf, if Wi satisfies case 2 then Wi 
necessarily is ambiguous and therefore in that case cr'(/3) = Wi as well. Thus, 
W C L(j3). 

ad 2. Obviously, Oabc and (3 are similar and there are two letters in S, namely b 
and c, that do not occur in these patterns. Consequently, the inclusion criterion 
given in Fact 1 is applicable. According to this, L{fi) C T(o;abc) since there exists 
a morphism 4 > ■ var(aabc) — > var(/3)* with </>(aabc) = P, given by = 7j_„ 

for every Xj G var(o!abc)- 

We now prove that L{P) is a proper subset of i(o;abc)- More precisely, we show 
that there is no morphism tp : var(/3) — > var(aabc)* with ip{P) = Oabc- For that 
purpose, assume to the contrary there is such a morphism tp. Then, as there is no 
variable in var(o;abc) with more than four occurrences in aabo ipixk) = e for all 
Xk G var(/3) with \P\x^ > 5. With regard to the variables in var(76.„), this means 
the following: If every letter in apxo) occurs more than four times in cri(x(a;abc)) 
then case 1 is satisfied and, consequently, every variable that is added to 75. z 
occurs at least five times in /3. If any letter A in ai{xo) occurs exactly four times in 
f7*(x(Q;abc)) “ and, obviously, it must be at least four times as | Oabc Ue = 4 - then 
case 2 is applied, which, enabled by the ambiguity of Wi in that case, arranges the 
newly added components of 75. z such that apupk)) is shifted to a different ^j^i. 
Consequently, \P\x^ P 5 for all Xk G var(7e „) and, therefore, 'ipije^n) = e yf Xq. 
Hence, we analyse whether or not var(o;abc) contains an anchor variable Xji in 
respect of (p and p (cf. Lemma 1). Evidently, j' ^ {1,7}; for j' G (3,4, 5,8}, 
Xj> being an anchor variable implies that V'(7j' n) ~ XkXk'5xkXk'6 with variables 
Xk, Xk' and 5 G A"*, but there is no substring in Oabc that equals the given shape 
of ipijj, „). Finally, X 2 cannot be an anchor variable since 'ip{'j 2 ,n) Fad to equal 
both X 2 X 3 S and X 2 X ^5 for a A G X* . Consequently, there is no anchor variable 
in var(aabc)- This contradicts z/’(76.n) = e yf xe and therefore the assumption is 
incorrect. Thus, L{P) 2 L(o:abc) and, finally, L{P) C T(aabc)- □ 
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Lemma 3. Let S := {a, b, c,d}. Then for Ofatcd and for every finite W C 
Li;(aabcd) there exists a passe-partout (3 € Pat. 

Proof. We can argue similar to the proof of Lemma 2: For an empty W the claim 
of Lemma 2 holds obviously. For any non-empty W = {wi,W2jW3, . . . ,ru„} C 
Li;(aabcd) there exist substitutions at, 1 < i < n, satisfying (Ti(aabcd) = Wi- 
With these at we give the following procedure that constructs a passe-partout 
(3 G Pat: 

Initially, we define 



Po ■= 7i,( 



with 7j^o := e for every j, 1 < j < 12. 

For every G IF we define an inverse substitution Ui : S* — > X* by 



CTi(A) 



2^4i-3 ) A — a , 

Xu-2 , A = b , 
^4z — 1 1 A c , 

Xu , A = d . 



For every i = 1, 2, 3, . . . , n we now consider the following cases: 

Case 1: There is no A G if with \ai{xe)\i^ = 1 and |cTj(x(aabcd))|A = 4 
Define 7^7 := 7j,i-i for every j, 1 < j < 12. 

Case 2: There is an A G if with \ai{xe)\i^ = 1 and |cTj(x(aabcd))|A = 4 
Case 2.1: A = a 

Define 717 := 717-1 ^■i{a^{xl a a;2 a:§ x|)) a]) , 

72, » := 72.J-1 dj([a\cr,(a:§)]) , 

79.* := 79,i-i a-i{a^{xg a X 2 xfg x^)) ^i{[<Xt{xl)/ a]) , 
lj,^ :=7j,*-i. j G {3,4,5,6,10}, 

7i,* := ^i{cFi{xj)), j G (7,8, 11, 12} . 

Case 2.2: A = b 

Define 7s,i := 78,*-i d(cTj(a;^ a;|)) ai{[(Ji{xl) m) , 

19.1 := 79,i-i CTj([b\(Ti(a;§ CC8 b xg)]) , 

7i2,i := 7i2,*-i di([b\(Ji(x| xj^ Xg b X12)]) , 

lj,t ■■= 7i.i-i, j G {4,5,6,7,11}, 

lj,i :=7i.i-i di{ai{xj)), j G {1,2,3,10}. 

Case 2.3: A = c 

Case 2.3.1: ai{x1 X5) G {a}* U {b}* U {d}* 

Define 74,* := 74,i-i di(crj(x4 X5)) , 

75,* ■ — 75,*— 1 0'i(fJi(xg)) , 

16.1 ^6,^—1 J 

7j,* ^i{cTi{xj)), j G {1,2,3,7,8,9,10,11,12}. 
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Case 2.3.2: <Ji(x 4 X 5 ) € {a, d}+ \ ({a}+ U {d}+) 

Define -fij := Jij-i ai(a^(xi a X2 x|)) x^)/ a]) , 

72, ^ '■= 72,i-i a-i([a\a^(xj x§ xl)]) , 

79,* :=79,i-i a-i{a^{xg a X 2 xIq)) d-i([cri(a;^ x|)/ a]) , 

73, ^ ■=7j,t-i, j G {3,4,5,6,10}, 

7i,* := 7i,i-i j G (7,8, 11, 12} . 



Case 2.3.3: ai{x“l Xg) G (a, b, d}+ \ (|a}+ U |b}+ U |d}+ U (a, d}+) 
Define 73 ,* := 78,i-i a-{[ai{xl x§)/ b]) , 

79, i ■= 79 ,i-i a^{[h\ai{xl xl xs b a;g)]) , 

7 i 2 .i := 712, *-i CTi([b\(Ji(x| xl xl xfi X8 b 0 : 12 )]) , 

7j,^ '■= 7j,i-i, j G {4,5,6,7,11}, 

7j,i '■=7j,i-i ^i((Ti(xj)), j G {1,2,3,10}. 



Case 2.4: A = d 

Adapt case 2.3 replacing d by c in the predicates of cases 2.3.1, 2.3.2 



2.3.3. 



and 



Finally, define 

Pi ■■= 7i,i a 72, i 73 .* 74 .* 75 .* 76.* 77 ,* 7s.* b 79 ,* a 72 , i 7io.i 74 .* 75 .* 76.* 7ii,* 7s.* b 7 i 2 .* • 
When this has been accomplished for every i, 1 < i < n, then define /3 := /3„. 

For the proof that /3 indeed is a passe-partout for «abcd and W, see the proof 
of Lemma 2, mutatis mutandis. □ 

Concluding the proof of Theorem 1, we state that it directly follows from 
Lemma 2, Lemma 3, and Fact 2: Obviously, any indexed family with 

{Li I i G N} = ePAT necessarily contains all languages generated by potential 
passe-partouts for Oabc and Oabcd, respectively. Thus, Li;(Q;abc) has no telltale in 
respect of ePAT i; if 1271 = 3 and Li;(aabcd) has no telltale in respect of ePATi; if 
|27| =4. Consequently, ePATu is not learnable for these two types of alphabets. 



3.2 Some Remarks 

Clearly, both procedures constructing the passe-partouts implement only one 
out of many possibilities. The definition of the 7^7 in case 2.3.1 in the proof of 
Lemma 3, for instance, could be separated in cases 2. 3. 1.1 and 2.3. 1.2 depending 
on the question whether or not ai{xlxl) G {a}+. If so then case 2.3. 1.1 could 
equal case 2.3.2, possibly leading to a different passe-partout. It can be seen 
easily that there are numerous other options like this. On the other hand, there 
are infinitely many different succinct patterns that can act as a substitute for 
Oabc and Oabcd in the respective lemmata. Some of these patterns, for instance, 
can be constructed replacing in a^bc and «abcd the substring oq = x^xlxl by 
any Oq = XpXp_^_i . . . 2 :^+^, p > max{j | Xj G var(o!abcd)}, 9 > 4. Hence, the phe- 
nomenon described in Lemma 2 and Lemma 3 is ubiquitous in ePAT. Therefore 
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we give some brief considerations concerning the question on the shortest pat- 
terns generating a language without telltale in respect of ePAT. Obviously, even 
for the proof concept of Lemma 2 and Lemma 3, shorter patterns are suitable. In 
Q^abc) e.g., the substring and the separating terminal symbol a in the middle 
of the pattern can be removed without loss of applicability; for Oatcd, e.g., the 
substrings and x'^ can be mentioned. Nevertheless, we consider both patterns 
in the given shape easier to grasp, and, moreover, we assume that the indicated 
steps for shortening Oabc and Oabcd lead to patterns with minimum length: 

Conjecture 2. Let the alphabets Si and S 2 be given by S\ := {a, b, c} and 
S 2 := {a, b, c,d}. Let the patterns «abc^ and «abcd^ be given by 

aabc^ := xi a X2 x^ X5 Xg X7 a X2 x\ Xg, 

«abcd' := Xi a X2 x\ x\ Xg Xg b a;g a Xg x^g x\ x\ Xg x\^ Xg b Xig. 

Then (oabcO ^^as no telltale in respect of ePAT , L ^2 (aabcdO has no telltale 
in respect of ePATj;^ and there do not exist any shorter patterns in Pat with 
this respective property. 

Finally, we emphasise that we consider it necessary to prove our result for 
both alphabet types separately. Obviously, for our way of reasoning, this is 
caused by the fact that the proof of Lemma 2 cannot be conducted with Oabcd 
since this pattern -- in combination with any passe-partout an adapted procedure 
could generate - does not satisfy the conditions of Fact 1 for alphabets with three 
letters. In effect, the problem is even more fundamental: Assume there are two 
alphabets Si and S 2 with Si C Ag. If for some a G Pati;j there is no telltale 
Pa C Ls 2 (c() - as shown to be true for Oabcd ~ then, at first glance, it seems 
natural to expect the same for Ls^icx) since Ls^ia) C Lj;^(a). These considera- 
tions, however, immediately are disproven, for instance, by the fact that ePAT is 
learnable for unary, but not for binary alphabets (cf. [11] and [14]). This can be 
illustrated easily, e.g., by Oabc and the pattern a = aaxi a. With Si = {a} and 
S 2 = {a,b} we may state Ls^{a) = Li;i(Q:abc), but Ls 2 {a) C Ls 2 {aa.-bc)- Thus, 
for Si both patterns generate the same language and, consequently, they have 
the same telltale, whereas any telltale for Ls 2 {ci&hc) has to contain a word that is 
not in Ls 2 (ck). The changing equivalence of E-pattern languages is a well-known 
fact for pairs of alphabets if the smaller one contains at most two distinct letters, 
but, concerning those pairs with three or more letters each, [12] conjectures that 
the situation stabilises. This is examined in the following section. 

4 CKabcd and the Equivalence Problem 

The equivalence problem for E-pattern languages - one of the most prominent 
and well discussed open problems on this subject - has first been examined in 
[7] and later in [8], [5], and [12]. The latter authors conjecture that, for patterns 
a,P G Pat and any alphabet S, | A| > 3, Ls{a) = Ls{(3) if and only if there are 
morphisms (j) : var(a) — > var(/3) and ^|) : var(/3) — > var(o!) such that 4>{a) = f3 
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and f/'(/3) = ct (cf. [12], paraphrase of Conjecture 1). Furthermore, derived from 
Fact 1 and Theorem 5.3 of [7], the authors state that the equivalence problem 
is decidable if the following question (cf. [12], Open Question 2) has a positive 
answer: For arbitrary alphabets 27i, S 2 with iFiil > 3 and F ’2 = ^iU{d}, d ^ i7i, 
and patterns a,P G Patu^, does the following statement hold: Ls^{a) = 
iff Ls^{a) = Ls^{P)l In other words: Is the equivalence of E-pattern languages 
preserved under alphabet extension? 

We now show that for lifil =3 this question has an answer in the negative, 
using Oabcd “ which for the learnability result in Section 3 is applied to 127] = 4 - 
and the following pattern: := xi a X 2 Xy xgb xg a X 2 x^q xh xs b xi 2 - 

Theorem 2. Let the alphabets Hi and E 2 be given by Ei := {a, b, c} and E 2 O 
{a,b,c,d}. Then = Ls^{a^), but Ls 2 io:~)- 

Proof. We first show that Ti;i(aabcd) = Let cr : {Ei U AT)* — ^ Ei be 

any substitution that is applied to a^. Then, obviously, the substitution a' with 
(j'{xj) = cr{xj) for all xj G var(o;.^) and a'(xj) = e for all Xj ^ var(a.^) leads to 
CT'(aabcd) = cr(a^) and, thus, C (oabcd)- 

Now, let a be any substitution that is applied to Oabcd- We give a second sub- 
stitution a' that leads to = cr(aabcd) and, thus, Lsi{ocr.f) = Li;i (oabcd): 

Case 1: a{x\ x^ x^) G {a, b, c}+ \ {b, c}+ 

Define cr'(xi) := cr(xi a X 2 a;|) [cr(a ;4 a;| Xg)/ a], 
cr'(a; 2 ) := [a\cr(a:| xl x§)], 
cr'(xg) := a(xg a X 2 xfg) [a{x\ xl x|)/ a], 
a'{xj) := a{xj), j G {7,8,11,12}, 
cr'{xj) := e, j G (3,4, 10}. 

Case 2: a{x\ x\ a;g) G (b, c}+ \ |c}+ 

Define a' symmetrically to case 1 using xg for xi, xs for xg, and X 12 for xg 
(cf., e.g., case 2.2 in the proof of Lemma 3). 

Case 3: cr(x 4 xl x\) G |c}* 

Define (t'{x 4 ) = (j{xd Xg Xg) and cr'(xj) = o{xj) for Xj G var(a^), j yf 4. 

The proof for yf Ls^icx^hcd) uses Fact 1 and Lemma 1 and is similar to 

the argumentation on L(/3) C L(o;abc) iu the proof of Lemma 2. □ 

Moreover, the reasoning on Theorem 2 reveals that Conjecture 1 in [12] - as 
cited above - is incorrect: 

Corollary 1. Let E be an alphabet, 127] = 3. Then Lsia^hcd) = Ls{a..S) and 
there exists a morphism (f> : var(oabcd) — > var(a.^) with 4>{a^bcd) = but there 
does not exist any morphism if : var(a^) — > var(o;abcd) with if{a^) = Oabcd- 

Note that the argumentation on Theorem 2 and Corollary 1 can be conducted 
with a pattern that is shorter than Oabcd (e.g., by removing xq). 

In [16], that solely examines the above questions for the transition between 
alphabets with four and alphabets with five letters, some methods of the present 
section are adopted and, thus, they are explained in more detail. 
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Abstract. Different formal learning models address different aspects 
of human learning. Below we compare Gold-style learning — interpreting 
learning as a limiting process in which the learner may change its mind 
arbitrarily often before converging to a correct hypothesis — to learning 
via queries — interpreting learning as a one-shot process in which the 
learner is required to identify the target concept with just one hypothesis. 
Although these two approaches seem rather unrelated at first glance, 
we provide characterizations of different models of Gold-style learning 
(learning in the limit, conservative inference, and behaviourally correct 
learning) in terms of query learning. Thus we describe the circumstances 
which are necessary to replace limit learners by equally powerful one- 
shot learners. Our results are valid in the general context of learning 
indexable classes of recursive languages. 

In order to achieve the learning capability of Gold-style learners, the 
crucial parameters of the query learning model are the type of queries 
(membership, restricted superset, or restricted disjointness queries) and 
the underlying hypothesis space (uniformly recursive, uniformly r.e., or 
uniformly 2-r. e. families). The characterizations of Gold-style language 
learning are formulated in dependence of these parameters. 



1 Introduction 

Undeniably, there is no formal scheme spanning all aspects of human learning. 
Thus each learning model analysed within the scope of learning theory addresses 
only special facets of our understanding of learning. 

For example, Gold’s [8] model of identification in the limit is concerned with 
learning as a limiting process of creating, modifying, and improving hypotheses 
about a target concept. These hypotheses are based upon instances of the target 
concept offered as information. In the limit, the learner is supposed to stabilize 
on a correct guess, but during the learning process one will never know whether 
or not the current hypothesis is already correct. Here the ability to change its 
mind is a crucial feature of the learner. 
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In contrast to that, Angluin’s [2,3] model of learning with queries focusses 
learning as a finite process of interaction between a learner and a teacher. The 
learner asks questions of a specified type about the target concept and the 
teacher — having the target concept in mind — answers these questions truthfully. 
After finitely many steps of interaction the learner is supposed to return its sole 
hypothesis — correctly describing the target concept. Here the crucial features 
of the learner are its ability to demand special information on the target con- 
cept and its restrictiveness in terms of mind changes. Since a query learner is 
required to identify the target concept with just a single hypothesis, we refer to 
this phenomenon as one-shot learning. 

Our analysis concerns common features and coincidences between these two 
seemingly unrelated approaches, thereby focussing our attention on the identifi- 
cation of formal languages, ranging over indexable classes of recursive languages, 
as target concepts, see [1,10,14]. In case such coincidences exist, their revelation 
might allow for transferring theoretically approved insights from one model to 
the other. In this context, our main focus will be on characterizations of Gold- 
style language learning in terms of learning via queries. Characterizing different 
types of Gold-style language learning in such a way, we will point out interesting 
correspondences between the two models. In particular, our results demonstrate 
how learners identifying languages in the limit can be replaced by one-shot learn- 
ers without loss of learning power. That means, under certain circumstances the 
capability of limit learners is equal to that of one-shot learners using queries. 

The crucial question in this context is what abilities of the teacher are re- 
quired to achieve the learning capability of Gold-style learners for query learners. 
In particular, it is of importance which types of queries the teacher is able to 
answer (and thus the learner is allowed to ask). This addresses two facets: first, 
the kind of information prompted by the queries (we consider membership, re- 
stricted superset, and restricted disjointness queries) and second, the hypothesis 
space used by the learner to formulate its queries and hypotheses (we consider 
uniformly recursive, uniformly r. e., and uniformly 2-r.e. families). Note that 
both aspects affect the demands on the teacher. 

Our characterizations reveal the corresponding necessary requirements that 
have to be made on the teacher. Thereby we formulate coincidences of the learn- 
ing capabilities assigned to Gold-style learners and query learners in a quite 
general context, considering three variants of Gold-style language learning. More- 
over, we compare our results to several insights in Gold-style learning via oracles, 
see [13] for a formal background. As a byproduct of our analysis, we provide a spe- 
cial indexable class of recursive languages which can be learned in a behaviourally 
correct manner^ in case a uniformly r. e. family is chosen as a hypothesis space, 
but which is not learnable in the limit, no matter which hypothesis space is cho- 
sen. Although such classes have already been offered in the literature, see [1], up 
to now all examples — to the authors’ knowledge — are defined via diagonalisation 



^ Behaviourally correct learning is a variant of learning in the limit, see for example 
[7,4,13]. A definition is given later on. 
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in a rather involved manner. In contrast to that, the class we provide below is 
very simply and explicitly defined without any diagonal construction. 



2 Preliminaries and Basic Results 

2.1 Notations 

Familiarity with standard mathematical, recursion theoretic, and language the- 
oretic notions and notations is assumed, see [12,9]. From now on, a fixed finite 
alphabet E with {a, 6} C is given. A word is any element from S* and a 
language any subset of E* . The complement L of a language L is the set E* \L. 
Any infinite sequence t = (wi)ieN with {wi | i G N} = L is called a text for L. 

A family (Ai)ieN of languages is uniformly recursive {uniformly 
r. e.) if there is a recursive (partial recursive) function / such 
that Ai = {w € E* \ f{i,w) = 1} for all i G N. A family (Ai)iGN 
is uniformly 2-r. e., if there is a recursive function g such that 
Ai = {w G E* I g{i,w,n) = 1 for all but finitely many n} for all z G N. 
Note that for uniformly recursive families membership is uniformly decidable. 

Let C be a class of recursive languages over A*. C is said to be an indexable 
class of recursive languages (in the sequel we will write indexable class for short), 
if there is a uniformly recursive family (Li)zGN of all and only the languages in C. 
Such a family will subsequently be called an indexing of C. 

A family (Tj)igN of finite languages is recursively generable, if there is a 
recursive function that, given z G N, enumerates all elements of and stops. 

In the sequel, let be a Godel numbering of all partial recursive functions 
and the associated Blum complexity measure, see [5]. 



2.2 Gold-Style Language Learning 

Let C be an indexable class, % = (Li)zGN any uniformly recursive family (called 
hypothesis space), and L € C. An inductive inference machine {IIM) M is an 
algorithmic device that reads longer and longer initial segments <t of a text 
and outputs numbers M{a) as its hypotheses. An IIM M returning some i is 
construed to hypothesize the language Li. Given a text t for L, M identifies L 
from t with respect to H in the limit, if the sequence of hypotheses output by M, 
when fed t, stabilizes on a number i (i. e., past some point M always outputs the 
hypothesis i) with Li = L. M identifies C in the limit from text with respect to H, 
if it identifies every L' G C from every corresponding text. LimTxt^ec denotes 
the collection of all indexable classes C for which there are an IIM M' and a 
uniformly recursive family %' such that M' identifies C' in the limit from text 
with respect toT-C . A quite natural and often studied modification of LimTxtrec 
is defined by the model of conservative inference, see [1]. M is a conservative IIM 
for C with respect to %, if M performs only justified mind changes, i. e., if M, on 
some text t for some L G C, outputs hypotheses z and later j, then M must have 
seen some element w Li before returning f. The collection of all indexable 
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classes identifiable from text by a conservative IIM is denoted by ConsvTxt-cec- 
Note that ConsvTxtrec C LimTxtrec [14]. Since we consider learning from text 
only, we will assume in the sequel that all languages to be learned are non-empty. 
One main aspect of human learning is modelled in the approach of learning in 
the limit: the ability to change one’s mind during learning. Thus learning is 
considered as a process in which the learner may change its hypothesis arbitrarily 
often until reaching its final correct guess. In particular, it is in general impossible 
to find out whether or not the final hypothesis has been reached, i.e., whether 
or not a success in learning has already eventuated. 

Note that in the given context, where only uniformly recursive families are 
considered as hypothesis spaces for indexable classes, LimTxtrec coincides with 
the collection of all indexable classes identifiable from text in a behaviourally 
correct manner, see [7]: If C is an indexable class, H = (Li)igN a uniformly 
recursive family, M an IIM, then M is a behaviourally correct learner for C from 
text with respect to "H, if for each L G C and each text t for C, all but finitely many 
outputs i oi M when fed t fulfil Li = L. Here M may alternate different correct 
hypotheses arbitrarily often instead of converging to a single hypothesis. Defining 
the notion BcTxt^ec correspondingly as usual yields BcTxtrec = LimTxtrec (a 
folklore result). In particular, each IIM Be Tirt-identifying an indexable class 
C' in some uniformly recursive family %' can be modified to an IIM Lim Txt- 
identifying C in H' . 

This coincidence no longer holds, if more general types of hypothesis spaces 
are considered. Assume C is an indexable class and = (C/i)igN is any uni- 
formly r. e. family of languages comprising C. Then it is also conceivable to use 

as a hypothesis space. LimTxt^,e. (BcTxt^,e,) denotes the collection of all 
indexable classes learnable as in the definition of LimTxtrec (BcTxtrec), if the 
demand for a uniformly recursive family B as a, hypothesis space is loosened 
to demanding a uniformly r. e. family B^ as a hypothesis space. Interestingly, 
LimTxtrec = LimTxtr.e. (a folklore result), i.e., in learning in the limit, the ca- 
pabilities of IIMs do not increase, if the constraints concerning the hypothesis 
space are weakened by allowing for arbitrary uniformly r. e. families. In con- 
trast to that, in the context of Be Tirt-identification, weakening these constraints 
yields an add-on in learning power, i.e., BcTxtrec C BcTxtr.e.- In particular, 
LimTxtrec C BcTxtr.e. and so LimTxt- and Be Tajf-learning no longer coincide 
for identification with respect to arbitrary uniformly r. e. families, see also [4,1]. 

Hence, in what follows, our analysis of Gold-style language learning will focus 
on the inference types LimTxtrec, ConsvTxtrec, and BcTxtr.e.. 

The main results of our analysis will be characterizations of these inference 
types in the query learning model. For that purpose we will make use of well- 
known characterizations concerning so-called families of telltales, see [1]. 

Definition 1. Let (Li)igN be a uniformly recursive family and « family 

of finite non-empty sets. {Ti)i^fi is called a family of telltales for (Lj)jgpi| iff for 
all i,j G N.' 

1. Tr c Lr. 

If Ti C Lj C Li, then Lj = Li. 
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The concept of telltale families is the best known notion to illustrate the 
specific differences between indexable classes in LimTxtrec, ConsvTxtrec, and 
BcTxtr.e.- Telltale families and their algorithmic structure have turned out to 
be characteristic for identifiability in our three models, see [1,10,14,4]: 

Theorem 1. Let C be an indexable class of languages. 

1. C € LimTxtrec iff there is an indexing ofC possessing a uniformly r. e. family 
of telltales. 

2. C € Consv Txtrec iff there is a uniformly recursive family comprising C and 
possessing a recursively generable family of telltales. 

3. C € BcTxt^ e, iff there is an indexing of C possessing a family of telltales. 

The notion of telltales is closely related to the notion of locking sequences, 
see [6]. If "H = is a hypothesis space, M an IIM, and L a language, 

then any finite text segment <t of L is called a Lzm Txt-locking sequence for 
M and L (a i?c Tirt-locking sequence for M, L and Ti.), if M{a) = M{aa') 
= UM(cra')) for all finite text segments a' of L. If L is LimTxt-leaxned 
by M {BcTxt-learned by M) respecting TL, then there exists a Lim Txt-locking 
sequence cr for M and L (a Be Txt-locking sequence for M, L, and T-L). Moreover, 
Um{ct) = L must be fulfilled for each such locking sequence. 

2.3 Language Learning Via Queries 

In the query learning model, a learner has access to a teacher that truthfully 
answers queries of a specified kind. A query learner M is an algorithmic device 
that, depending on the reply on the previous queries, either computes a new 
query or returns a hypothesis and halts, see [2]. Its queries and hypotheses are 
coded as natural numbers; both will be interpreted with respect to an underlying 
hypothesis space. When learning an indexable class C, any indexing H — (Li)igN 
of C may form a hypothesis space. So, as in the original definition, see [2], when 
learning C, M is only allowed to query languages belonging to C. 

More formally, let C be an indexable class, let L G C, let H = (Li)igN be 
an indexing of C, and let M be a query learner. M learns L with respect to TL 
using some type of queries if it eventually halts and its only hypothesis, say i, 
correctly describes L, i.e., Li = L. So M returns its unique and correct guess i 
after only finitely many queries. Moreover, M learns C with respect to % using 
some type of queries, if it learns every L' G C with respect to H using queries of 
the specified type. Below we consider, for learning a target language L: 
Membership queries. The input is a string w and the answer is ‘yes’ or ‘no’, 
depending on whether or not w belongs to L. 

Restricted superset queries. The input is an index of a language L' G C. The 
answer is ‘yes’ or ‘no’, depending on whether or not L' is a superset of L. 
Restricted disjointness queries. The input is an index of a language L' G C. The 
answer is ‘yes’ or ‘no’, depending on whether or not L' and L are disjoint.^ 

^ The term “restricted” is used to distinguish these types of query learning from 
learning with superset (disjointness) queries, where, together with each negative 
answer the learner is provided a counterexample, i. e., a word in L \ Lj (in Lt~] Lj). 
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MemQ, rSupQ, and rDisQ denote the collections of all indexable classes C' 
for which there are a query learner M' and a hypothesis space %' such that M' 
learns C with respect to %' using membership, restricted superset, and restricted 
disjointness queries, respectively. In the sequel we will omit the term “restricted” 
for convenience. In the literature, see [2,3], more types of queries such as (re- 
stricted) subset queries and equivalence queries have been analysed, but in what 
follows we concentrate on the three types explained above. 

Note that, in contrast to the Gold-style models introduced above, learning 
via queries focusses the aspect of one-shot learning, i. e., it is concerned with 
learning scenarios in which learning may eventuate without mind changes. 

Having a closer look at the different models of query learning, one easily finds 
negative learnability results. For instance, the class Csup consisting of the lan- 
guage L* = {a}*U{6} and all languages {a^ | fc < i}, z G N, is not learnable with 
superset queries. Assume a query learner M learns Csup with superset queries in 
an indexing (Lj)jgN of C and consider a scenario for M learning L* . Obviously, 
a query j is answered ‘yes’, iff Lj = L*. After finitely many queries, M hypoth- 
esizes L* . Now let z be maximal, such that a query j with Lj = {a^ \ k < i} has 
been posed. The above scenario is also feasible for the language {a^ \ k < i+1}. 
Given this language as a target, M will return a hypothesis representing L* and 
thus fail. This yields a contradiction, so Csup ^ rSupQ. 

Moreover, as can be verified easily, the class Cdis consisting only of the lan- 
guages {a} and {a, b} is not learnable with disjointness queries. 

Both examples point to a drawback of Angluin’s query model, namely the 
demand that a query learner is restricted to pose queries concerning languages 
contained in the class of possible target languages. Note that the class Csup would 
be learnable with superset queries, if it was additionally permitted to query the 
language {a}*, i. e., to ask whether or not this language is a superset of the target 
language. Similarly, Cdis would be learnable with disjointness queries, if it was 
additionally permitted to query the language {&}. That means there are very 
simple classes of languages, for which any query learner must fail just because 
it is barred from asking the “appropriate” queries. 

To overcome this drawback, it seems reasonable to allow the query learner to 
formulate its queries with respect to any uniformly recursive family comprising 
the target class C. So let C be an indexable class. An extra query learner for 
C is permitted to query languages in any uniformly recursive family (L')igN 
comprising C. We say that C is learnable with extra superset (disjointness) queries 
respecting (L')igN iff there is an extra query learner M learning C with respect to 
(L')igN using superset (disjointness) queries concerning (L')jgN. Then rSupQ^^^ 
(rDisQ^g^) denotes the collection of all indexable classes C learnable with extra 
superset (disjointness) queries respecting a uniformly recursive family. 

Our classes Csup and Cdis witness rSupQ C rSupQ^^^ and rDisQ C rDisQ^^^. 
Note that both classes would already be learnable, if in addition to the superset 
(disjointness) queries the learner was allowed to ask a membership query for the 
word b. So the capabilities of rSupQ-leaxners (rDis Q-leaxners) already increase 
with the additional permission to ask membership queries. Yet, as Theorem 2 
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shows, combining superset or disjointness queries with membership queries does 
not yield the same capability as extra queries do. For convenience, denote the 
family of classes which are learnable with a combination of superset (disjointness) 
queries and membership queries by rSupMemQ (rDisMemQ). 

Theorem 2. 1. rSupQ C rSupMemQ C rSupQ^^^. 

2. rDisQ C rDisMemQ G rDisQ^^^. 

Proof, ad 1. rSupQ C rSupMemQ is evident; the class Cgup yields the inequality. 

In order to prove rSupMemQ C rSupQ^.^^, note that, for any word w and any 
language L, w G L iS E* \ {ru} 2 L. This helps to simulate membership queries 
with extra superset queries. Further details are omitted. 

rSupQ^g^\rSupMemQ yf 0 is witnessed by the class C of all languages Lk and 
Lk,i for fc, ? G N, where Lk = {a^b^ \ z G N}, Lk^i = Lk, if (pk{k) is undefined, 
and Lk,i = {a^V \ z < <Lk{k) V z > d>k{k) + 1}, if (pk{k) is defined, see [10]. 

To verify C G rSupQ^.^^ choose a uniformly recursive family comprising C and 
all languages = {a’^b^ \ z < <Pk{k)}, fc G N. Note that G C iff ipk{k) is 
undefined. An rSupQ M for C may act on the following instructions. 

- For fc = 0, 1, 2, . . . ask a superset query concerning Lk, until the answer ‘yes’ 
is received for the first time, i.e., until some k with Lk ^ L is found. 

- Pose a superset query concerning the language LJ. (* Note that LI. is a 
superset of the target language iff LJ is infinite iff (pk{k) is undefined. *) 

If the answer is ‘yes’, then output a hypothesis representing Lk and stop. 

If the answer is ‘no’ (* in this case ifk{k) is defined *), then compute <Pk{k). 
Pose a superset query concerning Lkp. (* Note that, for any target language 
L C Lk, this query will be answered with ‘yes’ iff ^ L. *) 

If the answer is ‘no’, then output a hypothesis representing Lk and stop. 
If the answer is ‘yes’, then, for any I = 2,3,4:, ... , pose a superset query 
concerning Lkj. As soon as such a query is answered with ‘no’, for some I, 
output a hypothesis representing Lkj-i and stop. 

The details verifying that M learns C with extra superset queries are omitted. 

In contrast to that one can show that C ^ rSupMemQ. Otherwise the halting 
problem with respect to ip would be decidable. Details are omitted. 

Hence rSupMemQ C rSupQj.^^. 

ad 2. rDisQ C rDisMemQ is obvious; the class Cdis yields the inequality. 

In order to prove rDisMemQ C rDisQ^^^, note that, for any word w and 
any language L, ru G L iff {w} and L are not disjoint. This helps to simulate 
membership queries with extra disjointness queries. Further details are omitted. 

To prove the existence of a class in rDisQ .^^f\rDisMemQ, define an indexable 
class C consisting of Lq = {6} and all languages = {a*+^,6}, z G N. 

To show that C G rDisQ choose a uniformly recursive family comprising C 
as well as {a}* and all languages z G N. A learner M identifying C with 

extra disjointness queries may work according to the following instructions. 
Pose a disjointness query concerning {a}*. (* Note that the only possible 
target language disjoint with {a}* is Lq. *) 

If the answer is ‘yes’, then return a hypothesis representing Lq and stop. 
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If the answer is ‘no’, then, for i = 0, 1,2, . . . ask a disjointness query con- 
cerning until the answer ‘no’ is received for the first time. (* Note 

that this must eventually happen. *) As soon as such a query is answered 
with ‘no’, for some i, output a hypothesis representing and stop. 

The details verifying that M learns C with extra disjointness queries are omitted. 

In contrast one can show that C ^ rDisMemQ. For that purpose, to deduce a 
contradiction, assume that there is a query learner identifying C with disjointness 
and membership queries respecting an indexing (L')igN of C. Consider a learning 
scenario of M for the target language Lq- Obviously, each disjointness query is 
answered with ‘no’; a membership query for a word w is answered with ‘no’ iff 
w b . After finitely many queries, M must return a hypothesis representing Lq. 
Now let i be maximal, such that a membership query concerning a word a* 
has been posed. The scenario described above is also feasible for the language 
b}. If this language constitutes the target, then M will return a hypothesis 
representing L* and thus fail. This yields the desired contradiction. 

Hence rDisMemQ C rDisQ^^^. □ 



3 Characterizations of Gold-Style Inference Types 

3.1 Characterizations in the Query Model 

One main difference between Gold-style and query learning lies in the question 
whether or not a current hypothesis of a learner is already correct. A Gold- 
style learner is allowed to change its mind arbitrarily often (thus in general this 
question can not be answered), whereas a query learner has to find a correct 
representation of the target object already in the first guess, i.e., within “one 
shot” (and thus the question can always be answered in the affirmative). An- 
other difference is certainly the kind of information provided during the learning 
process. So, at first glance, these models seem to focus on very different aspects 
of human learning and do not seem to have much in common. 

Thus the question arises, whether there are any similarities in these models at 
all and whether there are aspects of learning both models capture. This requires 
a comparison of both models concerning the capabilities of the corresponding 
learners. In particular, one central question in this context is whether Gold-style 
(limit) learners can be replaced by equally powerful (one-shot) query learners. 
The rather trivial examples of classes not learnable with superset or disjointness 
queries already show that quite general hypothesis spaces — such as in learning 
with extra queries — are an important demand, if such a replacement shall be 
successful. In other words, we demand a more potent teacher, able to answer more 
general questions than in Angluin’s original model. Astonishingly, this demand is 
already sufficient to coincide with the capabilities of conservative limit learners: 
in [11] it is shown that the collection of indexable classes learnable with extra 
superset queries coincides with ConsvTxt^ec- And, moreover, this also holds for 
the collection of indexable classes learnable with extra disjointness queries. 

Theorem 3. rSupQ^^^ = rDisQ^^^ = ConsvTxtrec- 
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Proof. rSupQ^^^ = ConsvTxtrec holds by [11]. Thus it remains to prove that 
rSupQj.g^ = rDisQj.g^. For that purpose let C be any indexable class. 

First assume C G rDisQ^.^^. Then there is a uniformly recursive family 
and a query learner M, such that M learns C with extra disjointness queries with 
respect to Now define and z G N. 

Suppose L is a target language. A query learner M' identifying L with extra 
superset queries respecting (L')jgN is defined via the following instructions: 

- Simulate M when learning L. 

- If M poses a disjointness query concerning Li, then pose a superset query 

concerning your teacher. If the answer is ‘yes’, then transmit the 

answer ‘yes’ to M. If the answer is ‘no’, then transmit the answer ‘no’ to M. 
(* Note that L, n T = 0 iff L C I” iff L' 2,+1 A L. *) 

- If M hypothesizes Li, then output a representation for 

It is not hard to verify that M' learns C with extra superset queries with 
respect to Hence C G rSupQ„^,,. This implies rDisQ„^^ C rSupQ^,,,,. 

The opposite inclusion rSupQ^^,, C rDisQ„^^ is verified analogously. □ 

As initially in Gold-style learning, we have only considered uniformly recur- 
sive families as hypothesis spaces for query learners. Similarly to the notion of 
Be Txtr.e. , it is conceivable to permit more general hypothesis spaces also in the 
query model, i.e., to demand an even more potent teacher. Thus, by rSupQ„^ 
{rDisQ,.^ ) we denote the collection of all indexable classes which are learnable 
with superset (disjointness) queries respecting a uniformly r. e. family. Interest- 
ingly, this relaxation helps to characterize learning in the limit in terms of query 
learning. 

Theorem 4. rDisQ,.^ = LimTxtrec- 

Proof. First we show rDisQ^ ,, C LimTxt^ec- For that purpose, let C G rDisQ„ ^ 
be an indexable class. Fix a uniformly r. e. family (C/i)igN and a query learner 
M identifying C with disjointness queries with respect to {Ui)i^^. 

The following IIM M' Lzm Tirt-identifies C with respect to Given a 

text segment cr of length n, M' interacts with M simulating a learning process 
for n steps. In step k, k < n, depending on how M' has replied to the previous 
queries posed by M, the learner M computes either (i) a new query i or (ii) a 
hypothesis i. In case (ii), M' returns the hypothesis i and stops simulating M. 
In case (i), M' checks whether there is a word in a, which is found in C/j within 
n steps. If such a word exists, M' transmits the answer ‘no’ to M; else M' 
transmits the answer ‘yes’ to M . li k < n, M executes step fc -I- 1, else M' 
returns any auxiliary hypothesis and stops simulating M . Given segments a of 
a text for some target language, if their length n is large enough, M' answers 
all queries of M correctly and M returns its sole hypothesis within n steps. So, 
the hypotheses returned by M' stabilize on this correct guess. 

Hence C G LimTxtr.e.{= LimTxt^ec) and therefore rDisQ,.^ C LimTxt„ec- 
Second we show that LimTxtrec C rDisQ,.^ . So let C G LimTxtrec be an 
indexable class. Fix an indexing H — (Li)igN of C and an IIM M, such that M 
Lzm Tajf-identifies C with respect to PL. 
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Let (C/i)igN be any Godel numbering of all r. e. languages and (wa;)a;gN an 
effective enumeration of S* . Suppose L G C is the target language. An rDisQ- 
learner M' for L with respect to (f7i)igN is defined to act on the following instruc- 
tions, starting in step 0. Note that Godel numbers (representations in 
can be computed for all queries to be asked. Step n reads as follows: 



- Ask disjointness queries for {wq}, • ■ . ,{rcn}- Let L[„j be the set of words Wx, 
X < n, for which the corresponding query is answered with ‘no’. (* Note that 
L[„] = L n {wx \ x<n}. *) 

- Let (cr")a;gN be an effective enumeration of all finite text segments for L[„]. 

For all x,y < n pose a disjointness query for and thus build Gand„ = 

\ x,y < n and fl L = 0} from the queries answered with ‘yes’. 

(* Note that Gand„ = {a^ \ x,y <n and L C *) 

- For all (T G Gand„, pose a disjointness query for the language 



UL = 




if M{aa') yf M{a) for some text segment a' of Lm((t) , 
otherwise . 



(* Note that U'^ is uniformly r. e. in cr and = 0 iff ct is a Lim Ta;t-locking 

sequence for M and LM(a)- *) 

If all these disjointness queries are answered with ‘no’, then go to step n+1- 
Otherwise, if ct G Gand„ is minimal fulfilling LF fl L = 0, then return a 
hypothesis representing Lm((t) and stop. 

M' identifies L with disjointness queries respecting because (i) M' even- 

tually returns a hypothesis and (ii) this hypothesis is correct for L. To prove (i), 
note that M is a LimTxt-leainer for L respecting So there are i,x,y 

such that = i, Li = L, and ct| is a Lim Txt-locking sequence for M and L. 

Then Lf'^y = 0 and the corresponding disjointness query is answered with ‘yes’. 
Thus M' returns a hypothesis. To prove (ii), assume M' returns a hypothesis 
representing Lm(ct) for some text segment ct of L. Then, by definition of M' , 
L C and ct is a Lzm Ta;t-locking sequence for M and L^i^^y In particular, 

CT is a Lim Txt-locking sequence for M and L. Since M learns L in the limit from 
text, this implies L = LM(a)- Hence the hypothesis M' returns is correct for L. 

Therefore C G rDisQ^g. and LimTxtrec C rDisQ^^ . □ 



Reducing the constraints concerning the hypothesis spaces even more, let 
rSupQ 2 _y,e, (rDis ( 52 _r.e.) denote the collection of all indexable classes which are 
learnable using superset (disjointness) queries with respect to a uniformly 2-r.e. 
family.^ This finally allows for a characterization of the classes in BcTxtr.e.- 



Theorem 5 . rSupQ2_y,e, = rDisQ2_y,e, = BcTxtr,e.. 



Proof. First we show rSupQ 2 _y,e. ^ BcTxtr,e. and rDisQ 2 _y,e. ^ BcTxtr,e.. For 
that purpose, let C G rSupQ 2 .y,e, {C G rDisQ 2 _yx!,) be an indexable class, {Li)i^jq 
an indexing of C. Fix a uniformly 2-r.e. family (Vj)igN and a query learner M 
identifying C with superset (disjoint ness) queries respecting (Vj)*^^. 

® With analogous definitions for Gold-style learning one easily obtains Lim Txt 2 -r.e. = 
LimTxtr.e. ~ LimTxtrec and BcTxt 2 -r.e. = BcTxtr.fi.. 
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To obtain a contradiction, assume that C ^ BcTxtr.e.- By Theorem 1, (Li)i^jq 
does not possess a telltale family. In other words, there is some t G N, such that 
for any finite set W C Li there exists some j G N satisfying W C Lj C Li. (*) 

Consider M when learning Li. In the corresponding learning scenario S 

— M poses queries representing V-- , ... , V-- , V-+ ,... , V-+ (in some order); 

— the answers are ‘no’ for V-- .... , V-- and ‘yes’ for V-+ .... , V-+ ; 

Lyti 

— afterwards M returns a hypothesis representing Li. 

That means, for all z G {I,-- - ,k}, we have C-- 2 Li C yf 0). In 
particular, for all z G {1, • ■ • , k}, there is a word G Li\V-- {wz G V-- (iLi). Let 
W = {wi, . . . , mfc}(C Li). By (*) there is some j G N satisfying W Q Lj C Li. 

Now note that the above scenario S is also feasible for Lj-. Wz G Lj implies 
Vi~ 2 Lj (V-- n Ly yf 0) for all z G {1, . . . ,k}. V-+ D Li (V.+ n = 0) implies 
b)+ 3 Lj {V-+r\Lj = 0) for all z G {1, . . . , to}. Thus all queries in S are answered 
truthfully for Lj. Since M hypothesizes Li in the scenario S, and Li yf Lj, M 
fails to identify Lj. This is the desired contradiction. 

Hence C G BcTxt^,e., so rSupQ 2 _z.e. Q BcTxtr.e. , rDisQ 2 .r.e. ^ BcTxtr.e.- 

Second we show that BcTxtr.e. Q f'SupQ 2 .r.e. BcTxtr.e. C rDisQ 2 _r.e.- So 
let C G BcTxtr.e. be an indexable class. Fix a uniformly r.e. family (C/i)igN and 
an IIM M, such that M Be Taj^.e. -identifies C with respect to (C/j)jgN. 

Let (Fi)igN be a uniformly 2-r.e. family such that indices can be computed 
for all queries to be asked below. Let (wx)xeti an effective enumeration of S*. 

Assume L G C is the target language. A query learner M' identifying L 
with superset (disjointness) queries respecting (Fi)igN is defined according to 
the following instructions, starting in step 0. Step n reads as follows: 



- Ask superset queries for i7*\{rt;i} (disjointness queries for {wi}) for all i < n. 
Let L[„] be the set of words w^, x < n, for which the corresponding query is 
answered with ‘no’. (* Note that = L fl {wx \ x < n|. *) 

- Let (cr")a;gN be an effective enumeration of all finite text segments for L[„]. 

For all x,y < n pose a superset query for UM(^rrl) (a disjointness query for 
L^M(al)) and thus build Cand„ = \ x,y < n and ^ L} = {a}j, \ 

x,y <n and UM(^rrl) H L = 0} from the queries answered with ‘yes’. 

- For all cr G Cand„, pose a superset (disjointness) query for the language 







if UM(a) ^ LfM(aa') for some text segment a' of Um{(t) , 
otherwise . 



(* Note that is uniformly 2-r. e. in a and ^ L iS. L = % lE a is a, 
Be Txt-locking sequence for M and UM{a)- *) 

If all these superset queries are answered with ‘yes’ (all these disjointness 
queries are answered with ‘no’), then go to step n+1. Otherwise, if cr G Cand„ 
is minimal fulfilling ^ L and thus IB fl L = 0, then return a hypothesis 
representing Um(.o-) and stop. 

M' learns L with superset (disjointness) queries in (I})igN) because (i) M' even- 
tually returns a hypothesis and (ii) this hypothesis is correct for L. To prove (i), 
note that M is a Be Tzi-learner for L in ([/i)igN- So there are x,y such that 
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= L and is a BcTxt-loc\d\ig sequence for M, L, and {Ui)i^n- Then 
= 0 and the corresponding superset query is answered with ‘no’ (the dis- 
jointness query with ‘yes’). Thus M' returns a hypothesis. To prove (ii), suppose 
M' returns a hypothesis representing for a text segment a of L. Then, by 

definition of M', ct is a 5c Txt-locking sequence for M, UM{a), and In 

particular, ct is a BcTxt-locking sequence for M, L, and (C/i)igN- As M BcTxt- 
learns L, this implies L = Um{ct) and the hypothesis of M' is correct for L. 

Therefore C € rSupQ 2 .^,e. H r5zs(52_r.e.5 and thus BcTxty,e. C rSupQ 2 _j-,e. and 
BcTxtr,e. C rDisQ 2 .y,e - □ 



3.2 Characterizations in the Model of Learning with Oracles 
A Comparison 

In our characterizations we have seen that the capability of query learners 
strongly depends on the hypothesis space and thus on the demands concern- 
ing the abilities of the teacher. Of course a teacher might have to be more 
potent to answer questions with respect to some uniformly r. e. family than 
to work in some uniformly recursive family. For instance, teachers of the first 
kind might have to be able to solve the halting problem with respect to some 
Godel numbering. In other words, the learner might use such a teacher as an 
oracle for the halting problem. The problem we consider in the following is to 
specify nonrecursive sets ACM such that A-recursive"^ query learners using 
uniformly recursive families as hypothesis spaces are as powerful as recursive 
learners using uniformly r. e. or uniformly 2-r. e. families. For instance, we know 
that r5is(5rec C rDisQ^^ = LimTxtrec- So we would like to specify a set A, such 
that LimTxt^ec equals the collection of all indexable classes which can be iden- 
tified with A-recursive r5zs(5 ^ec'^sarners. The latter collection will be denoted 
by r5zs(5rec[A]. Subsequently, similar notions are used correspondingly. 

In the Gold-style model, the use of oracles has been analysed for example 
in [13]. Most of the claims below use A-recursive or Tot-recursive learners, where 
K = {i \ (pi{i) is defined} and Tot = {i \ pi \s a, total function}. Goncerning 
coincidences in Gold-style learning, the use of oracles is illustrated by Lemma 1. 

Lemma 1. 1. [13] ConsvTxtrec[K] = LimTxtrec- 

2. ConsvTxtj:ec[Tot] = LimTxt-teciK] = BcTxty,e,. 

3. BcTxt-r,e\A\ = BcTxty e, for all A C N. 

Proof, ad 3. Let A C N. By definition BcTxtj-e, C BcTxtygfA], Thus it remains 
to prove the opposite inclusion, namely BcTxtr.e.[A] C BcTxtr.e.- For that pur- 
pose let C G BcTxtr.e. [A] be an indexable class. Fix an A-recursive IIM M such 
that C is 5c Txtr.e. -learned by M. Moreover, let (Tj)jgjij be an indexing of C. 

Striving for a contradiction, assume C ^ BcTxt^,e.. By Theorem 1, (Li)igN 
does not possess a telltale family. In other words, there is some i G N, such that 
for any finite set IF C Tj there exists some j G N satisfying W C Lj C Li. 

A-recursive means recursive with the help of an oracle for the set A. 
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Since M is a BcTxt-leaineT for Li in some hypothesis space "H, there must 
be a BcTxt-locking sequence a for M, Li, and H. If W denotes the set of words 
occurring in cr, there is some language Lj G C with W Q Lj G Li. Thus cr is a 
BcTxt-locking sequence for M, Lj, and H. In particular, M fails to BcTxt^.e.- 
identify Lj. This yields the contradiction. Hence BcTxtr.e.[A] = BcTxtr,e.. 

ad 2. The proofs of Consv Txt^eclTot] C BcTxt^.e., LimTxtyec[K] C BcTxL.e. 
are obtained by similar means as the proof of 3. It suffices to use Theorem I for 
ConsvTxtrec and LimTxt^ec instead of the accordant statement for BcTxt^,e.. 
Note that LimTxtrec[K] = BcTxtr.e. is already verified in [4]. 

Next we prove BcTxtr.e. Q ConsvTxtrec[Tot] and BcTxtr.e. Q LimTxtrec[K]. 
For that purpose, let C be an indexable class in BcTxtr.e.- By Theorem 1 there 
is an indexing (Tj)jgN of C which possesses a family of telltales. Next we show: 

(i) (Ti)igN possesses a Tot-recursively generable (uniformly K-r.e.) family 
of telltales. 

(ii) A Consv TxtrecAcaxnev {LimTxtrec~lcBxnec) for C can be computed from 
any recursively generable (uniformly r.e.) family of telltales for (Li)igp}. 

To prove (i). Let (wx)xen be an effective enumeration of all words in S* . 
Given i G N, let a function fi enumerate a set as follows. 

“ /i(0) = Wz for z = min{x | Wx G Li}. 

- If /i(0), . . . ,fi{n) are computed, then test whether or not there is some j G N 
(some j < n), such that {/i(0), . . . , fi{n)} C Lj C Li. (* Note that this test 
is Tot-recursive (AT-recursive). *) 

- If such a number j exists, then fi{n -I- I) = Wz for z = min{x | Wx G 
Li\{fi{0), ■ ■ ■ ,fi{n)}}. If no such number j exists, then fi{n + I) = fi{n). 

With Ti = {fi{x) I x G N}, it is not hard to verify that (Ti)igN is a Tot-recur- 
sively generable (uniformly K-r. e.) family of telltales for {Li)i^fq. Here note that, 
in the case of using a Tot-oracle, Tj = {fi{x) \ fi{y + 1) yf fi{y) for all y < x}. 
Finally, (ii) holds since Theorem 1.1/1. 2 has a constructive proof, see [1,10]. 
Claims (i) and (ii) imply C G ConsvTxtrec[Tot] and C G LimTxtrec[K]. So 
BcTxtr.e. G ConSvTxtrec[Tot] and BcTxtr.e. G LimTxtrec[K]. □ 

Since this proof is constructive as are the proofs of our characterizations 
above, we can deduce results like for example rDisQredK] = LimTxtrec'- Given 
C G LimTxtreci a iF-recursive conservative HM for C can be constructed from a 
LimTxt rec-lcaxnev for C. Moreover, a rDisQrecAcaxner for C can be constructed 
from a conservative HM for C. Thus, a AT-recursive rDisQrec-l^sxner for C can be 
constructed from a TimTa;trec-learner. Similar results are obtained by combining 
Lemma 1 with our characterizations above. This proves the following theorem. 

Theorem 6. 1. rSupQreei^] = ^I^'^^Qreci^] = LimTxtrec- 

2- rSupQre,c\Tot] = rDisQrec[Tot] = BcTxtr.e. - 
3. rSupQ 2 _r.e.[A] = rDisQ 2 _r.e.[A] = BcTxtr.e. for all A C N. 

4 Discussion 

Our characterizations have revealed a correspondence between Gold-style learn- 
ing and learning via queries — between limiting and one-shot learning processes. 




168 S. Lange and S. Zilles 



Crucial in this context is that the learner may ask the “appropriate” queries. 
Thus the choice of hypothesis spaces and, correspondingly, the ability of the 
teacher is decisive. If the teacher is potent enough to answer disjointness queries 
in some uniformly r. e. family of languages, then, by Theorem 4, learning with 
disjointness queries coincides with learning in the limit. Interestingly, given uni- 
formly recursive or uniformly 2-r. e. families as hypothesis spaces, disjointness 
and superset queries coincide respecting the learning capabilities. As it turns 
out, this coincidence is not valid, if the hypothesis space may be any uniformly 
r. e. family. That means, rDisQ^ ^ (and Lim Txtrec) is not equal to the collection 
of all indexable classes learnable with superset queries in uniformly r. e. families. 

Theorem 7. LimTxtrec C rSupQ^ ^ . 

Proof. To verify LimTxtrec C rSupQ^ ^ , the proof of LimTxtrec C rDisQr ^. can 
be adapted. It remains to quote a class in rSupQr e. \ LimTxtrec- 

Let, for all fc, n G N, Cum contain the languages Lk = {a^b^ I z > 0} and 

{a^b^ I z < m} , if m < n is minimal such 

that (pk{rn) is undefined, 

I z < n} U , if ipk(x) is defined for all a; < n 

and y = maxj^fc (x) | x < n} . 

Ciim is an indexable class; the proof is omitted due to the space constraints. 
To show C\im G rSupQ^^ , let (C/i)igN be a Godel numbering of all r. e. lan- 
guages. Assume L G C is the target language. A learner M identifying L with 
superset queries respecting is defined to act on the following instructions: 

- For k = 0, 1, 2, . . . ask a superset query concerning Lk U {6”a® | r, s G N}, 
until the answer ‘yes’ is received for the first time. 

- Pose a superset query concerning the language Lk- 

If the answer is ‘no’, then, for r, s = 0, 1, 2, . . . ask a superset query con- 
cerning Lk U until the answer ‘yes’ is received for the first time. 

Output a hypothesis representing Lk^r and stop. 

If the answer is ‘yes’, then pose a superset query for the language 

{a^b^ \ z < n} , if n is minimal, such that (pk{n) is undefined, 
{a^b^ I z > 0} , if (pfc is a total function . 

(* Note that [/(. is uniformly r. e. in k. U'j. is a superset of A iff = L. *) 
If the answer is ‘yes’, then return a hypothesis representing [/(. and stop. 
If the answer is ‘no’, then return a hypothesis representing Lk and stop. 
The details proving that M rb’rtp Q-identifies Cum respecting {Lfi)i^n are omitted. 

Finally, Cum ^ LimTxtrec holds, since otherwise Tot would be iF-recursive. 
To verify this, assume M is an IIM learning Cum in the limit from text. Let 
fc > 0. To decide whether or not (fk is a total function, proceed as follows: 

Let a he a Lim Txt-locking sequence for M and Lk- (* Note that cr exists 
by assumption and thus can be found by a AT-recursive procedure. *) If there 
is some x < max{z | a^b^ occurs in cr}, such that ipk{x) is undefined (* also a 
AT-recursive test *), then return ‘O’. Otherwise return ‘1’. 
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It remains to show that {pk is total, if this procedure returns ‘1’. So let the 
procedure return ‘1’. Assume (pk is not total and n is minimal, such that ipk(n) 
is undefined. By definition, the language L = {a^b^ \ z < n} belongs to Ciim- 
Then the sequence cr found in the procedure is also a text segment for L and 
by choice — since L C Lk — a Lim Txt-locking sequence for M and L. As M(a) is 
correct for Lk, M fails to identify L. This is a contradiction; hence <pk is total. 
Thus the set Tot is AT-recursive — a contradiction. So Ciim ^ LimTxt^ec- n 

Since rSupQ^^ C rSupQ 2 .^j,,,, one easily obtains rSupQ^^ C BcTxti,e.. 
Whether or not these two collections are equal, remains an open question. Still 
it is possible to prove that any indexable class containing just infinite languages 
is in rSupQj. g iff it is in BcTxt^ ^,. We omit the proof. In contrast to that there 
are classes of only infinite languages in BcTxL.e. \ LimTxtrec- 

Moreover, note that the indexable class Ciim defined in the proof of Theorem 7 
belongs to BcTxtr.e. \LimTxtrec- Up to now, the literature has not offered many 
such classes. The first example can be found in [1], but its definition is quite 
involved and uses a diagonalisation. In contrast to that, C^m is defined com- 
pactly and explicitly without a diagonal construction and is — to the authors’ 
knowledge — the first such class known in BcTxtr.e. \ LimTxtrec- 
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Abstract. We show several PAC-style concentration bounds for learn- 
ing unigrams language model. One interesting quantity is the probability 
of all words appearing exactly k times in a sample of size m. A standard 
estimator for this quantity is the Good- Turing estimator. The existing 



analysis on its error shows a PAC bound of approximately O 

We improve its dependency on fc to O analyze the 

empirical frequencies estimator, showing that its PAC error bound is ap- 
proximately O -I- . We derive a combined estimator, which has 

an error of approximately O , for any k. 

A standard measure for the quality of a learning algorithm is its expected 
per-word log-loss. We show that the leave-one-out method can be used 
for estimating the log-loss of the unigrams model with a PAC error of 
approximately O for any distribution. 

We also bound the log-loss a priori, as a function of various parameters 
of the distribution. 



1 Introduction and Overview 

Natural language processing (NLP) has developed rapidly over the last decades. 
It has a wide range of applications, including speech recognition, optical charac- 
ter recognition, text categorization and many more. The theoretical analysis has 
also advanced significantly, though many fundamental questions remain unan- 
swered. One clear challenge, both practical and theoretical, concerns deriving 
stochastic models for natural languages. 

Consider a simple language model, where the distribution of each word in the 
text is assumed to be independent. Even for such a simplistic model, fundamental 
questions relating sample size to the learning accuracy are already challenging. 
This is mainly due to the fact that the sample size is almost always insufficient, 
regardless of how large it is. 

To demonstrate this phenomena, consider the following example. We would 
like to estimate the distribution of first names in the university. For that, we 
are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, 
Frank, two Georges, and two Henries. How can we use this sample to estimate the 
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distribution of students’ first names? An empirical frequency estimator would 
assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, 
while George, appearing twice, would get estimation of 0.2. Unfortunately, un- 
seen names, such as Michael, will get an estimation of 0. Clearly, in this simple 
example the empirical frequencies are unlikely to estimate well the desired dis- 
tribution. 

In general, the empirical frequencies estimate well the probabilities of pop- 
ular names, but are rather inaccurate for rare names. Is there a sample size, 
which assures us that all the names (or most of them) will appear enough times 
to allow accurate probabilities estimation? The distribution of first names can 
be conjectured to follow the Zipf’s law. In such distributions, there will be a sig- 
nificant fraction of rare items, as well as a considerable number of non-appearing 
items, in any sample of reasonable size. The same holds for the language uni- 
grams model, which tries to estimate the distribution of single words. As it has 
been observed empirically on many occasions ([2], [5]), there are always many 
rare words and a considerable number of unseen words, regardless of the sample 
size. Given this observation, a fundamental issue is to estimate the distribution 
the best way possible. 



1.1 Good- Turing Estimators 

An important quantity, given a sample, is the probability mass of unseen words 
(also called ’’the missing mass”). Several methods exist for smoothing the prob- 
ability and assigning probability mass to unseen items. The almost standard 
method for estimating the missing probability mass is the Good- Turing estima- 
tor. It estimates the missing mass as the total number of unique items, divided 
by the sample size. In the names example above, the Good-Turing missing mass 
estimator is equal 0.6, meaning that the list of the class names does not re- 
flect the true distribution, to put it mildly. The Good-Turing estimator can be 
extended for higher orders, that is, estimating the probability of all names ap- 
pearing exactly k times. Such estimators can also be used for estimating the 
probability of individual words. 

The Good-Turing estimators date to World War II, and were published at 
1953 ([10], [11]). They have been extensively used in language modeling appli- 
cations since then ([2], [3], [4], [15]). However, their theoretical convergence rate 
in various models has been studied only in the recent years ([17], [18], [19], [20], 
[22]). For estimation of the probability of all words appearing exactly k times in 
a sample of size to, [19] shows a PAG bound on Good-Turing estimation error 
of approximately O ■ 

One of our main results improves the dependency on k of this bound to 
approximately O + m)- show that the empirical frequencies esti- 
mator has an error of approximately O , for large values of k. Based 

on the two estimators, we derive a combined estimator with an error of approxi- 
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mately O {m i ^ , for any k. We also derive a weak lower bound of f2 for 

an error of any estimator based on an independent sample. 

Our results give theoretical justification for using the Good- Turing estimator 
for small values of k, and the empirical frequencies estimator for large values 
of k. Though in most applications the Good-Turing estimator is used for very 
small values of k (e.g. fc < 5, as in [15] or [2]), we show that it is fairly accurate 
in a much wider range. 

1.2 Logarithmic Loss 

The Good-Turing estimators are used to approximate the probability mass of 
all the words with a certain frequency. For many applications, estimating this 
probability mass is not the main optimization criteria. Instead, a certain distance 
measure between the true and the estimated distributions needs to be minimized. 

The most popular distance measure widely used in NLP applications is the 
Kullhack-Leibler (KL) divergence. For P = {px} and Q = {qx}, two distribu- 
tions over some set X, this measure is defined as ^xPx^^ ^ equivalent 
measure, up to the entropy of P, is the logarithmic loss (log-loss), which equals 

Many NLP applications use the value of log-loss to evaluate the quality of 
the estimated distribution. However, the log-loss cannot be directly calculated, 
since it depends on the underlying distribution, which is unknown. Therefore, 
estimating log-loss using the sample is important, although the sample cannot 
be independently used for both estimating the distribution and testing it. The 
hold-out estimation splits the sample into two parts: training and testing. The 
training part is used for learning the distribution, whereas the testing sample 
is used for evaluating the average per-word log-loss. The main disadvantage of 
this method is the fact that it uses only part of the available information for 
learning, whereas in practice one would like to use all the sample. 

A widely used general estimation method is called leave- one- out. Basically, 
it means averaging all the possible estimations, where a single item is chosen for 
testing, and the rest is used for training. This procedure has an advantage of 
using the entire sample, in addition it is rather simple and usually can be easily 
implemented. The existing theoretical analysis of the leave-one-out method ([14], 
[16]) shows general PAG-style concentration bounds for the generalization error. 
However, these techniques are not applicable in our setting. 

We show that the leave-one-out estimation error for the log-loss is approx- 
imately O for Eniy underlying distribution. In addition, we show a PAG 

bound for the log-loss, as a function of various parameters of the distribution. 

1.3 Model and Semantics 

We denote the set of all words as V, and N = \V\. Let P be a distribution 
over V, where p^ is the probability of a word w G V. Given a sample S of size 
m, drawn i.i.d. using P, we denote the number of appearances of a word w in 
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S as c®, or simply Cw, when a sample S is clear from the context^. We define 
Sk = {w G V : = k}, and Uk = |5'fc|. 

For a claim <I> regarding a sample S, we write V'^S' for P(^[S']) >1 — 5. 
For some PAC bound function /(•), we write 0(/(-)) for O (/(•) (in ^)°), where 
c > 0 is some constant, and 6 is the PAC error probability. 

Due to lack of space, some of the proofs are omitted. Detailed proofs can be 
found at [7]. 

2 Concentration Inequalities 

In this section we state several standard Chernoff-style concentration inequali- 
ties. We also show some of their corollaries regarding the maximum- likelihood 
approximation of Pw hy Pw = 

Lemma 1. (Hoeffding’s inequality: [13], [18]) Let Y = Yi, . . . ,Y„ be a set of n 
independent random variables, such that Yi G [bi,bi + di\. Then, for any e > 0, 



P 









> € < 2 exp — 



2e^ 



This inequality has an extension for various functions of {Yi, . . . , P„}, which 
are not necessarily the sum. 

Lemma 2. (Variant of McDiarmid’s inequality: [21], ]6[) Let Y = Yi,...,Y„ 
be a set of n independent random variables, and f{Y) such that any change of 
Yi value changes f(Y) by at most di. Let d = max^ di. Then, 



V^Y: |/(Y)-A[/(Y)]|< 




Lemma 3. (Angluin- Valiant bound: [1], [18]) Let Y = Yi, . . . , Y„ be a set of n 
independent random variables, where Yi G [0,i?]. Let p = E'f^^Yi], Then, for 
any e > 0, 



P 







< 2 exp I — 



(2/i -|- e)P 



The next lemma shows an explicit upper bound on the binomial distribution 
probability^ . 

^ Unless mentioned otherwise, all farther sample-dependent definitions depend on the 
sample S. 

^ Its proof is based on Stirling approximation directly, though local limit theorems 
could be used. This form of bound is needed for the proof of Theorem 4. 
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Lemma 4. Let X ~ Bin{n,p) be a binomial random variable, i.e. a sum of 
n i.i.d. Bernoulli random variables with p G (0,1). Let p = E[X] = np. For 
X G (0,n], there exist some = exp + O (^)), such that Vfc G {0, . . . ,n}, 
we have P{X = k) < ^ t t" — ■ integral values of p, the equality is 

achieved at k = p. (Note that for x > 1, we have = 0{1).) 

The next lemma (by Hoeffding, [12]) deals with the number of successes in 
independent trials. 

Lemma 5. ([12], Theorem 5) Let Yi, . . . ,Yn G {0, 1} be a sequence of indepen- 
dent trials, with pi = E\Yi], Let X = ^^Yi be the number of successes, and 
p = [I'YhiPi average trial success probability. For any integers b and c 

such that Q <b < np < c < n, we have: 

E < ^ (^ < ^ < c) < 1 

k=b ^ ' 

Using the above lemma, the next lemma shows a general concentration bound 
for a sum of arbitrary real- valued functions of a multinomial distribution compo- 
nents. We show that with a small penalty, any Chernoff-style bound pretending 
the components being independent is valid^. We recall that c®, or equivalently 
Cuj, is the number of appearances of the word re in a sample S of size m. 

Lemma 6. Let {c(„ ^ Bin{m,pw) \ w & V} be independent binomial ran- 
dom variables. Let {fw{x) : w G V} be a set of real valued functions. Let 
^ = J2w fw(cw) and F' = />i'(Cm)- For any e > 0, 

P{\F-E[E]\ >e) < 3^ P i\F' - E [F']\ > e) 

The following lemmas provide concentration bounds for maximum-likelihood 
estimation of by ^ . 

Lemma 7. Let 5 > 0, and A > 3. We have 



Vtu G V, 


S.t. 


mpn, > 31n 


\mpw 


VI 

u 

1 


'imp,,, In 


Vtu G V, 


s.t. 


mpu, > Ain 


C-w ^ 


1— 

1 

^1^ 


1 mpn, 



2m 

~T 



Lemma 8. Let 5 G (0,1), and m > 1. Then, Vw G V such that mpw < 
3 In ™ , we have < 6 In ^ . 

® The negative association analysis ([8]) shows that a sum of negatively associated 
variables must obey Chernoff-style bounds pretending that the variables are inde- 
pendent. The components of a multinomial distribution are negatively associated. 
Therefore, any Chernoff-style bound is valid for their sum, as well as for the sum 
of monotone functions of the components. In some sense, our result extends this 
notion, since it does not require the functions to be monotone. 




Concentration Bounds for Unigrams Langnage Model 175 



3 Hitting Mass Estimation 



In this section our goal is to estimate the probability of the set of words appearing 
exactly k times in the sample, which we call ’’the hitting mass”. We analyze the 
Good-Turing estimator, the empirical frequencies estimator, and the combined 
estimator. 



Definition 1. We define the hitting mass and its estimators as: 



Mk= Pw 
weSk 




Gk = 



k+1 
m — k 






The outline of this section is as follows. Definition 3 slightly redefines the 
hitting mass and its estimators. Lemma 9 shows that this redefinition has a neg- 
ligible influence. Then, we analyze the estimation errors using the concentration 
inequalities from Section 2. 

The expectation of the Good-Turing estimator error is bounded as in [19]. 
Lemma 14 bounds the deviation of the error, using the negative association 
analysis. A tighter bound, based on Lemma 6, is achieved at Theorem 1. Theorem 
2 analyzes the error of the empirical frequencies estimator. Theorem 3 refers to 
the combined estimator. Finally, Theorem 4 shows a weak lower bound for the 
hitting mass estimation. 

Definition 2. For any w € V and i € {0, • • • , m}, we define as a random 
variable equal 1 if Cw = i, and 0 otherwise. 



Definition 3. Let a > 0 and k > 3a^. We define Ik, a 
and Vk,a = {w & V Pw & Ik,a}- We define: 



k — OL\/k fc+l+Q\/fc+l 



^k,a — ^ ^ Pw — ^ ^ Pw^w,k 

Gk,a = yl'S'fc+1 n Vk,a\ = y Xw,k+l 



m — k 



m — k 



Mk,a = — |<S'fcnVfcQ| = — Xw, 

m m ^ ' 



w&Vk.o 

k 



By Lemma 7 and Lemma 8, for large values of k the redefinition coincides 
with the original definition with high probability: 

Lemma 9. For 6 > 0, let a = In For k > ISln^p, we have W^S: 
Mk — Affc Q,, Gk — Gk,a) and Mk — ^k.a- 

^ The Good-Turing estimator is usually defined as The two definitions are 

almost identical for small values of k. Following [19], we use our definition, which 
makes the calculations slightly simpler. 
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Since the minimal probability of a word in Vk^a is (m)> derive: 
Lemma 10. Let a > 0 and k > 3a^. Then, \Vk,a\ = O (^)- 
Using Lemma 4, we derive: 

Lemma 11. Let a > 0 and 3a^ < k < ^. Let w G Vk^a- Then, E[Xu,^k] = 



P^Cw - (^)- 



3.1 Good- Turing Estimator 

The following lemma, based on the definition of the binomial distribution, was 
shown in Theorem 1 of [19]. 

Lemma 12. For any k < m, and w € V, we have: 

k I 1 

PwP{Cw = k) = yPiCyj = k + 1)(1 - P„) 

m — k 

The following lemma bounds the expectations of the redefined hitting mass, 
its Good-Turing estimator, and their difference. 

Lemma 13. Let a > 0 and 3a^ < fc < ^. We have E[Mk,a] = 

E[Gk,^] = O (^), and |E[Gfc.„] - E[Mfc,„]| = O (^) . 

Using the negative association notion, we can show a preliminary bound for 
Good-Turing estimation error: 

Lemma 14. For 6 > 0 and 18 In ^ < fc < ^, we have 

\Gk - Mk\ = O ( \l 



Lemma 15. Let S > 0, k > 0. Let U CV. Let : w & U} be a set of weights, 
such that byj € [0, B], Let Xk = X^tuec/ ^wXw,k, and yc = E[Xk]- We have: 



M^S, jXfc - ^1 < max|y4i?/iln ^ ^ J,2i31n|^ ^ J | 

Proof. By Lemma 6, combined with Lemma 3, we have: 

P(|X, - Ml > t) < 6^ exp 

< maxjeVm exp ,6 \/to exp , (1) 

where (1) follows by considering e < 2/i and e > 2/x separately. The lemma 
follows substituting e = max | JABfxln , 2B In ^ 1 . □ 
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We now derive the concentration bound on the error of the Good-Turing 
estimator. 

Theorem 1. For (5 > 0 and 181n ^ < k < we have V^S: 



\Gk-Mk\ = 0 




m 



Proof. Let a = y 6 In Using Lemma 9, we have V 2 5’: = Gk,aj and Mk = 

Recall that ^ Pw^w.k and a m—k ■ 

Both Mfc CK and G^^ql are linear combinations of and respectively, 

where the coefficients’ magnitude is O (^), and the expectation, by Lemma 13, 
1 

y/k 



^ . By Lemma 15, we have: 




= (2) 

- E[GkA\ = O (3) 

Combining (2), (3), and Lemma 13, we have S\ 

\Gk — Mk\ = \Gk,a — ^k,a\ 



< |Gfe,„ - E[Gk,a\\ + - E[Mk,a\\+ \E[Gk,a\ ~ E[Mk,a\\ 




which completes the proof. □ 

3.2 Empirical Frequencies Estimator 

In this section we bound the error of the empirical frequencies estimator Mk- 
Theorem 2. For (5 > 0 and 181n ^ < k < we have: 



y^s, \Mk 



Mk\ = 0 



m 




Proof. Let a = By Lemma 9, we have V^S”: Mk = Mk,a, and Mk = 

Mk,a- Let = {w e Vk,a -Pw < E}, and = {ic G 14, „ ■ Pw > E}. Let 
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and let X? specify either X_ or X_|_. By the definition, for w € Vk^a we 
have \^-Pw\ = O By Lemma 10, |Vfc,a| = O (f). By Lemma 11, for 

u> G Vk,a we have E[X^^k] = O • Therefore, 



■weVk,a \ / 

Both X- and are linear combinations of Xw,k, where the coefficients are 
O and the expectation is O [f). Therefore, by Lemma 15, we have: 

Vl5: (5) 

By the definition of X- and X+, Mk^a — Mk^a = — X-. Combining (4) 

and (5), we have V'^S': 



\Mk -Mk\ = \Mk,o. - Mk,^\ = |^+ - ^-1 



< |X+ - E[X+]\ + E[X+] + \X_ - E[X_]\ + E[X_] 




since -\/o6 = 0(o + b), and we use a = and b = f. □ 



3.3 Combined Estimator 

In this section we combine the Good-Turing estimator with the empirical fre- 
quencies to derive a combined estimator, which is accurate for all values of k. 



Definition 4. ITe define Mk, a combined estimator for Mk, by: 



Mk 



Gk k < mi 
Mk k > ms 
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Lemma 16. (Theorem 3 at [19]) Let k € {0, . . . , m}. For any <5 > 0, we have: 




The following theorem shows that Mk has an error bounded by O (m s j , for 
any k. For small k, we use Lemma 16. Theorem 1 is used for 18 In ^ < k < mi . 
Theorem 2 is used for < fc < The complete proof also handles k > y- 

Theorem 3. Let <5 > 0. For any fc G {0, . . . , m}, we have: 

The following theorem shows a weak lower bound for approximating It 
applies to estimating Mk based on a different independent sample. This is a very 
’’weak” notation, since Gk, as well as Mk, are based on the same sample as Mk- 

Theorem 4. Suppose that the vocabulary consists of f] words distributed uni- 
formly (i.e. Pw = ^), where 1 <C A: <C m. The variance of Mk is O ■ 

4 Leave-One-Out Estimation of Log-Loss 

Many NLP applications use log-loss as the learning performance criteria. Since 
the log-loss depends on the underlying probability P, its value cannot be ex- 
plicitly calculated, and must be approximated. The main result of this section. 
Theorem 5, is an upper bound on the leave-one-out estimation of the log-loss, 
assuming a general family of learning algorithms. 

Given a sample S = {si, . . . , Sm}, the goal of a learning algorithm is to 
approximate the true probability P by some probability Q. We denote the prob- 
ability assigned by the learning algorithm to a word w hy Qw 

Definition 5. We assume that any two words with equal sample frequency are 
assigned equal probabilities in Q, and therefore denote q^ by q{cyf). Let the log- 
loss of a distribution Q be: 



L = 



Pw In 

wGV 



1 

Qw 



Y^Mk In 

k>0 



1 

q{k) 



Let the leave-one-out estimation, q],, be the probability assigned to w, when 
one of its instances is removed. We assume that any two words with equal sample 
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frequency are assigned equal leave-one-out probability estimation, and therefore 
denote q[„ by q'{cw)- We define the leave-one-out estimation of the log-loss as: 



L 



leave— one 





1 

q'{k) 



Let Lyj — L{^c,,f) — In and L^ — L (c^) — In — 

maxfc max {L(fc), L'(k + 1)}. 



In this section we discuss a family of learning algorithms, that receive the sam- 
ple as an input. Assuming an accuracy parameter 5, we require the following 
properties to hold: 



1. Starting from a certain number of appearances, the estimation is close to 
the sample frequency. Specifically, for some a,/3 G [0, 1], 



V/c > In , q{k) 



k — a 
m — (3 



(6) 



2. The algorithm is stable when a word is extracted from the sample: 



Vm, 2</c<101n^, 

Vm, yS s.t. nf > 0, k G {0, 1}, 



\L'{k+l)-L{k)\ = o(^^^ 
\L'{k+l)-L{k)\ = o(^^'^ 



( 7 ) 

( 8 ) 



An example of such an algorithm is 
(we assume that the vocabulary is large 



the following leave-one-out algorithm 
enough so that no -I- ni >0): 



Qw — 



N—np — l 
(no+ni)(m-l) 

Ctt, — 1 

m—1 






The next lemma shows that the expectation of the leave-one-out method is 
a good approximation for the per-word expectation of the logarithmic loss. 



Lemma 17. Let 0 < a < and y > 1. Let Bn ~ Bin{njp) be a binomial 
random variable. Let fy{x) = ln(max(x, y)). Then, 



{)<E 



Pfy{Bn 



a) 



n 




n 




Concentration Bounds for Unigrams Langnage Model 181 



Sketch of Proof. For a real valued function F (here F(x) = fy{x — a)), we have: 



E 



— F{B^ - 1) 

n 



x—l ^ ' 



= pE[F{B„_i)] 



where we used (") ^ The rest of the proof follows by algebraic manip- 

ulations, and the definition of the binomial distribution (see [7] for details). □ 



Lemma 18. Let 5 > 0. We have \/^S: ri 2 = O 



((^mlni + ni)lnf). 



Theorem 5. For S > 0, we have: 



\L - Li 



eave—one 



= O Lrr,.n.A 



Proof. Let t/u, = ^1 — ~ 2- By Lemma 7, with A = 5, we have S: 



\/w G V : puj > 
Ww G V : pw > 



31n^ 

m 

51n^ 



\Pw ^ < 



3 p^ In ^ 



Vh = { 



Let Vh = \w G V : pw > 



51n 



I and Vl = V \ Vh. We have: 



(9) 



,c„>2/^ + 2>(5-yi5)ln^>ln^ (10) 



|L-L, 



eave—one 



< 



w^Vh 



E (r-i. - f-f.) + E (»-i" - 



w^Vl 



( 11 ) 



We start by bounding the first term of (11). By (10), we have Vw G 
Vh,Cw > + 2 > \n^. Assumption (6) implies that therefore 



— Iv, IBizA — 



L,„ = In 



= In 



m—fS 

max ( Cti; — a , 1 /it, ) 



/ 1 m—1 — 0 



and L[„ = In 



Cuj 1 Oc 



= In 



m—l — 0 
max(cit, — l — (x,y.w 



-. Let 



H m- P 

Err^ = — In ^ 

m max(Cu, — 1 — a,yw) 



m — P 

Pwin r 

max(cu, - a, Utu) 
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We have: 



X! PwLJ^ 



w^Vh 



< 



Err. 

wGVh 

Err. 

wGVh 



H , m-l- !3 ^ 

^ m 



m — (3 



w^Vh 



H 



o{ - 

m 



(12) 



We bound |X),ueVK using McDiarmid’s inequality. As in Lemma 17, 

let fw{x) = ln(max(a;, j/u,)). We have: 



E\Err^ =\Yi{m — (3)E 



L m 



■Pw 



E 



Pw fw(.^w ek) /u>(ou> 1 ek) 

m 



The first expectation equals 0, the second can be bounded using Lemma 17: 



E ^ 

w£Vh 



< y ^ = o(- 

m \m 

wGVh ^ 



(13) 



In order to use McDiarmid’s inequality, we bound the change of 
as a function of a single change in the sample. Suppose that a word u is replaced 
by a word v. This results in decrease for c„, and increase for c„. Recalling that 
Uw = f2{mpw), the change of Err^ , as well as the change of Err^ , is bounded 
by O 0^) (see [7] for details). 

By (12), (13), and Lemma 2, we have V^S”: 



wGVh 





(14) 



Next, we bound the second term of (11). By Lemma 8, we have V 



3 S': 



31n — 

Vw G V s.t. Pw < Cu, < 61n y 



(15) 



Let 6 = 51n By (9) and (15), for any w such that , we have: 



m 



< max <Pw 3- 



Sp^ln^ 61n^ (5 + V3*5)ln^ 2h 



< 



< 



m 



m 



m 



m 
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Therefore Ww GVl, we have < 2b. Let = |Vl nS'fc|, 
and VrnSfcP™- Using algebraic manipulations (see [7] for details), we 

have: 



X! PwLw^ 



w^Vl 
26-1 



kn 



26-1 






fc=i 



k—O 



26-1 26-1 y 

< ^ G^|L'(A: + 1) - m\ +Y.\Gk- Mt\m + O ( 

k =0 fc =0 ^ 



bLn 



m 



(16) 



The first sum of (16) is bounded using (7), (8), and Lemma 18 (with accuracy 
:^). The second sum of (16) is bounded using Lemma 16 separately for every 
k < 2b with accuracy Since the proof of Lemma 16 also holds for and 
MJ^ (instead of Gk and Mk), we have 'iis, for every k < 2b, \G^ — M^\ = 

O ■ Therefore (the details can be found at [7]), 




The proof follows by combining (11), (14), and (17). □ 



5 Log-Loss A Priori 

Section 4 bounds the error of the leave-one-out estimation of the log-loss. In this 
section we analyze the log-loss itself. We denote the learning error (equivalent 
to the log-loss) as the KL-divergence between the true and the estimated distri- 
bution. We refer to a general family of learning algorithms, and show an upper 
bound for the learning error. 

Let a £ (0, 1) and t > 1. We define an (absolute discounting) algorithm Aa^r, 
which ’’removes” ^ probability mass from words appearing at most t times, and 
uniformly spreads it among the unseen words. We denote by ni,,,r = 
the number of words with count between 1 and t. The learned probability Q is 
defined by : 



ocni_,_r 

muQ 

m 

Ctt, 

m 



Cyj 0 

1 E < T 
T <C^ 



Qw — 
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Theorem 6. For any <5 > 0 and A > 3, such that r < (A — let 

\ In 

X = — and Nx = |{i« £ V : Pw > a^}|- Then, the learning error of is 
hounded V'^S' by: 



o< V 



Since counts only words with p^u > x, it is bounded by Therefore, 
X = m~ 3 gives a bound of O (^Mq In fV + m~ J ^ . Lower loss can be achieved 

for specific distributions, such as those with small Mg and small (for some 
reasonable x). 
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Abstract. We dehne the problem of inferring a “mixture of Markov 
chains” based on observing a stream of interleaved outputs from these 
chains. We show a sharp characterization of the inference process. The 
problems we consider also has applications such as gene finding, intrusion 
detection, etc., and more generally in analyzing interleaved sequences. 

1 Introduction 

In this paper we study the question of inferring Markov chains from a stream 
of interleaved behavior. We assume that the constituent Markov chains output 
their current state. The sequences of states thus obtained are interleaved by 
some switching mechanism (such as a natural mixture model). Observe that if 
we only observe a (probabilistic) function of the current state, the above prob- 
lem already captures hidden Markov models and probabilistic automata, and is 
computationally intractable as shown by Abe and Warmuth [1]. Our results can 
therefore be interpreted as providing an analytical inference mechanism for one 
class of hidden Markov models. The closely related problem of learning switching 
distributions is studied by Freund and Ron [10]. 

Thiesson et al. study learning mixtures of Bayesian networks and DAG mod- 
els [16,17]. In related works, learning mixtures of Gaussian distributions are 
studied in [6,3]. The hidden Markov model, pioneered in speech recognition (see 
[14,4]) has been the obvious choice for modeling sequential patterns. Related 
Hierarchical Markov models [11] were proposed for graphical modeling. Mixture 
models have been studied considerably in the context of learning and even earlier 
in the context of pattern recognition [8] . To the best of our knowledge, mixture 
models of Markov chains have not been explored. 

Our motivation for studying the problem is in understanding interleaved pro- 
cesses that can be modeled by discrete-time Markov chains. The interleaving 
process controls a token which it hands off to one of the component processes 
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** This work was supported by NSF CCR98-20885 and NSF CCROl-05337. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 186—199, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 




Inferring Mixtures of Markov Chains 



187 



at each time step. A component process that receives the token makes a tran- 
sition, outputs its state, and returns the token. We consider several variants of 
the interleaving process. In the simplest, tokens are handed off to the component 
processes with fixed probabilities independent of history. A more general model 
is where these hand-off probabilities are dependent on the chain of the state that 
was generated last. The following are potential applications of our framework. 

— The problem of intrusion detection is the problem of observing a stream 
of packets and deciding if some improper use is being made of system 
resources.^ We can attempt to model the background (good) traffic and the 
intrusive traffic being different Markov processes. We then model the overall 
traffic as a random mixture of these two types of traffic. The problem of 
fraud detection arises in this context as well; see [7,18,12,9] for models on 
intrusion and fraud detection. 

— Given a genome sequence (a sequence from a chromosome) the problem 
is to locate the regions of this sequence (called exons) that collectively 
represent a gene. Again, precise defining characteristics are not known for 
exons and the regions in between them called introns. However, a number 
of papers have attempted to identify statistical differences between these 
two types of segments. Because the presence of a nucleotide at one position 
affects the distribution of nucleotides at neighboring positions one needs to 
model these distributions (at least) as first-order Markov chains rather than 
treating each position independently. In fact, fifth-order Markov chains and 
Generalized Hidden Markov Models (GHMMs) are used by gene finding 
programs such as GENSGAN [5]. 

— The problem of validation and mining of log-files of transactions arises in e- 
commerce applications [2,15]. The user interacts with a server and the only 
information is available at the server end is a transcript of the interleaved 
interactions of multiple users . Typically searches/queries/requests are made 
in “sessions” by the same user; but there is no obvious way to determine if two 
requests correspond to the same user or different ones. Gomplete information 
is not always available (due to proxies or explicit privacy concerns) and at 
times unreliable. See [13] for a survey of issues in this area. 

The common theme of the above problems is the analysis of a sequence that 
arises from a process which is not completely known. Furthermore the problem is 
quite simple if exactly one process is involved. The complexity of these problems 
arise from the interleaving of the two or more processes due to probabilistic 
linearization of parallel processes rather than due to adversarial intervention. 



^ We do not have a precise definition of what constitutes such intrusion but we expect 
that experts “will know it when they see it.” 
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1.1 Our Model 

Let M^'^\ . . . , be Markov chains where Markov chain has state 
space Vi for I = 1,2, . . . ,k. The inference algorithm has no a priori knowledge of 
which states belong to which Markov chains. In fact, identifying the set of states 
in each chain is the main challenge in the inference problem. 

One might be tempted to “simplify” the picture by saying that the process 
generating the data is a single Markov chain on the cross-product state space. 
Note, however, that at each step we only observe one component of the state of 
this cross-product chain and hence with this view, we are faced with the problem 
of inferring a hidden Markov model. Our results can therefore be interpreted as 
providing an analytical inference mechanism for one class of hidden Markov 
models where the hiding function projects a state in a product space to an 
appropriate component. We consider two mixture models. 

— In the simpler mixture model, we assume that there are probability values 
ai,. . . ,ak summing to 1 such that at each time step, Markov chain 

is chosen with probability ai. The choices at different time steps are 
assumed to be independent. Note that the number k of Markov chains (and, 
necessarily, the mixing probabilities) are not known in advance. 

— A more sophisticated mixture model, for example, in the case of modeling 
exons and introns, would be to assume that at any step the current chain 
determines according to some probability distribution which Markov chain 
(including itself) will be chosen in the next step. We call this more sophisti- 
cated model the chain- dependent mixture model. 

We assume that all Markov chains considered are ergodic which means that 
there is a fcp such that every entry in is non-zero for k > k^. Informally, 
this means that there is a non-zero probability of eventually getting from any 
state i to any state j and that the chain is aperiodic. We also assume that the 
cover time^ of each of the Markov chains is bounded by t, a polynomial in the 
maximum number of states in any chain — these restrictions are necessary to es- 
timate the edge transition probabilities of any Markov chain in polynomial time. 
Furthermore, since we cannot infer arbitrary real probabilities exactly based on 
polynomially many observations, we will assume that all probabilities involved 
in the problem are of the form p/q where all denominators are bounded by some 
bound Q. As long as we are allowed to observe a stream whose length is some 
suitable polynomial in Q, we will infer the Markov chains exactly with high 
probability. 



^ The cover time is the maximum over all vertices u of the expected number of steps 
required by a random walk that starts at u and ends on visiting every vertex in the 
graph. For a Markov chain M, if we are at vertex v we choose the next vertex to be 
v' with probability M„yi. 
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1.2 Our Results 

We first consider the version of the inference problem where the Markov chains 
have pairwise-disjoint state sets in the chain-dependent mixture model. In this 
model, the interleaving process is itself a Markov Chain whose cover time we 
denote by ri. We show the following result in Section 3. 

Theorem 1. For Markov chains over disjoint state sets and the chain- 
dependent mixture model, we can infer a model of the source that is observa- 
tionally equivalent, to the original source, i.e., that the inferred model generates 
the exact same distribution as the target model. We make the assumption that 
an, i.e., the probability of observing the next label from the same Markov process 
is non-zero. We require a stream of length 0{T^TfQ^), where Q is the upper 
bound on the denominator of any probability represented as a fraction, and t\,t 
are upper bounds on the cover times of the interleaving and constituent processes, 
respectively. 

We can easily show that our upper bound in Theorem 1 is a polynomial 
function of the minimum length required to estimate each of the probabilities. 
Next, we prove that it is necessary to restrict to disjoint-state-set Markov chains 
to achieve polynomial-time inference schemes. 

Theorem 2. Inferring chain dependent mixture of Markov chains is computa- 
tionally intractable. In particular, we show that the inference of two state prob- 
ablistic automata (with variable alphabet size) can be represented in this model. 

The question about the inference of simple probabilistic mixture of Markov 
chains with overlapping state spaces arises naturally as a consequence of the 
above two theorems. Although we do not get as general a result as Theorem 1, 
we show the following in Section 4, providing evidence towards a positive result. 

Theorem 3. For two Markov chains on non-disjoint state sets, we can infer 
the chains in the simple mixture model with a stream of length 0(poly(n)) where 
n is the total number of states in both chains, provided that there is a state ig 
that occurs in only one chain, say and satisfies the technical condition: 

either > Si{j) or = 0 for all states j 

where S\ is the stationary distribution of 

To make sense of the technical condition above consider the special case where 
the Markov chain is a random walk in a graph. The condition above is satisfied 
if there is a state that occurs in only one graph that has a small degree. This 
condition sounds plausible in many applications. 
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2 Preliminaries and Notation 

We identify the combined state space of the given Markov chains with the set 
[n] {1, . . . , n}. Suppose . . . , are finite-state ergodic Markov chains 

in discrete time with state space Vi C [n] corresponding to We consider 

two possible cases — one where the state spaces of the individual Markov chains 
are disjoint and the other where they are allowed to overlap. Suppose each 
Markov chain outputs its current state after it makes a transition. The first and 
the simpler mixture model that we consider generates streams with the alphabet 
[n] in the following manner. Let oi, . . . , be such that ai = 1. Assume that 
initial states are chosen for each of the Markov chains arbitrarily. The stream 
is generated by interleaving the outputs of Markov chains , . . . , For 

each stream element, an index I is chosen according to the distribution defined 
by o/’s. Then, is allowed to make a transition from its previous state and 
its output is appended to the stream. Define Si{i) to be the probability of i in 
the stationary distribution of M^''\ 

A more general mixture model we explore is where the probability distribution 
for choosing the Markov chain that will make a transition next is dependent 
on the chain of the last output state. For i,j G [n], we use to denote the 
probability that the control is handed off to Markov chain that j belongs to when 
the last output was i. Note that for states 11,12 in the same chain, ai^j = ai^j 
and aji-^^ = aji^ for all states j G [n]. Since we use this mixture model only for 
Markov chains with disjoint state spaces, Oy ’s are well defined. 

We will sometimes denote the interleaving process by I. Then we can denote 
the entire interleaved Markov process by a tuple, {M^^\ . . . , 

Let Ti denote the (relative) frequency of occurrence of the state i. Given a 
pattern {ij) let Tij be the frequency of j occurring immediately after i. Likewise 
define Tijs to be the frequency of the pattern (ijs). 

We define the problem of inferring mixtures of Markov chains as given a stream 
generated as described above, constructing the transition matrices for the un- 
derlying Markov chains as well as the mixing parameters. The problem reduces 
to identifying the partitioning of the state space — since given a partitioning we 
can project the data on each of the partitions and identify the transition prob- 
abilities. 

It is also clear that if two Markov chain mixtures produce each finite length 
stream with equal probability, then they are indistinguishable by our techniques. 
Consequently we need a notion of observational equivalence. 

Definition 1. Two interleaved processes V = . . . , ;I) and V = 

. . . , ^;I') are observationally indistinguishable if there is an as- 

signment of initial state probabilities to each chain of V' for every assignment 
of initial states to the chains in V such that for any finite sequence in [n]* the 
probability of the sequence being produced by V is equal to the probability of the 
sequence being produced by V' . 




Inferring Mixtures of Markov Chains 



191 



Note that we have no hope of disambiguating between observationally equivalent 
processes. We provide an example of such pairs of processes: 

Example. Let process V = where is the trivial single-state 

Markov chain on state 1 and is the trivial single-state Markov chain on 

state 2. Let I be the process which chooses each chain with probability | at 
each step. 

Let process V = where I' trivially always chooses M' and M' 

is a 2-state process which has probability \ for all transitions. V and V' are 
observationally indistinguishable . 

Definition 2. A Markov chain is defined to be reducible to one-step mix- 
ing if for all i,j € Vi we have = Si{j), i.e., the next state distribution is 
also the stationary distribution. 



Proposition 1. If is reducible to one-step mixing, where |Mi| = z, the 

interleaved process V = . . . , is observationally indistinguishable 

from V = {Mp'^ , Mp'^ ,.. . Mp'^ ; I') for some interleaving pro- 

cess I' , where Mp'^ indicates the Markov chain defined on the single state r £ Vi. 

The interleaving process I' is defined as follows: If in I the probability of tran- 
sition from some chain into in V is a, in I' the probability of transition 

from the same chain to Mp'^ is aS'i(j). Transition probabilities from Mp'^ are 
the same in I' as the transition probabilities from in I. 

Remark: Note that a one-step-mixing Markov chain is a zeroth-order Markov 
chain and a random walk on it is akin to drawing independent samples from 
a distribution. Nevertheless, we use this terminology to highlight the fact that 
such chains are a special pathological case for our algorithms. 

3 Markov Chains on Disjoint State Spaces 

In this section, we consider the problem of inferring mixtures of Markov chains 
when state spaces are pairwise disjoint. To begin with, we will assume the simpler 
mixture model. In Section 3.2, we show how our techniques extend to the chain- 
dependent mixture model. 



3.1 The Simple Mixture Model 

Our algorithm will have two stages. In the first stage, our algorithm will discover 
the partition of the whole state space [n] into sets V\,. . . ,Vm which are the state 
spaces of the component Markov chains. Then, it is easy to infer the transition 
probabilities between states by looking at the substream corresponding to states 
in each V/. Once we infer the partition of the states, the mixing parameter afs 
can be estimated accurately from the fraction of states in Vi within the stream. 
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The main idea behind our algorithm is that certain patterns of states occur 
with different probabilities depending on whether the states in the pattern 
come from the same chain or from different chains. We make this idea precise 
and describe the algorithm in what follows. 

Recall that Si is the stationary distribution vector for the Markov chain 
extended to [n]. It is well know that the probability that Markov chain 
visits a state i tends to Si{i) as time goes to infinity. It follows that in our 
mixture model, the probability that we see a state i in our stream tends to 

S{i) aiSi{i) 

where I is such that i G Vi. Note that I is unique since the state spaces are 
disjoint. Hence, one can get an estimate Ti for S{i) by observing the frequencies^ 
of each state i in the stream. The accuracy of this estimate is characterized by 
the following lemma. 

Lemma 1. For all i, the estimate S{i) is within 6“*^*^*^ of Ti when the length 
of the stream is at least rt/(mini(ai)) where r is maximum eover time of any 
chain. 

We make the following key observations. 

Proposition 2. For i,j G Vj, we expect to see the pattern (if) in the stream 
with the frequency aiS{i)M^j\ 

In particular, if states i and j belong to the same Markov chain but the transition 
probability from i to j is 0, the pattern {ij) will not occur in the stream. 

Proposition 3. For states i and j from separate Markov chains, we expect the 
frequency of the pattern (ij), Tij to be equal to TiTj. 

There is an important caveat to the last proposition. In order to accurately 
measure the frequencies of patterns {ij) where i and j occur in different Markov 
chain, it is necessary to look at positions in the stream that are sufficiently spaced 
to allow mixing of the component Markov chains. Consequently, we fix a priori, 
positions in the stream which are 17 (tQ) apart where r is the maximum cover 
time and Q is the upper bound on the denominator of any probability represented 
as a fraction. We then sample these positions to determine the estimate on the 
frequency of various patterns. 

Since the values of S and T are only estimates, we will use the notation 
when we are comparing equalities relating such values. By the argument given in 
Lemma 1, these estimation errors will not lead us to wrong deductions, provided 

® Here and elsewhere in the paper “frequency” refers to an estimated probability, i.e., 
it is a ratio of the observed number of successes to the total number of trials where 
the definition of “success” is evident from the context 
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that the estimates are based on a long enough stream. Using the estimates S{-) 
and the frequency one can make the following deduction: 

— If Tij 9 ^ T{Tj, then i,j belong to the same chain. 

In the case that i,j G Vi and = S{j), or equivalently = Si{j). the 

criterion above does not suffice to provide us with clear evidence that i and j 
belong to the same Markov Chain and not to different Markov Chains. The next 
proposition may be used to disambiguate such cases. 

Proposition 4. Suppose i,j G Vi such that ^ Si{j). Suppose for a state 
p we cannot determine if p G Vi using the test above, ^ then p G Vi if and 
only if pattern (ipj) has the frequency S (i) S (p) S (j) , which translates to the test 
T « T T T- 

J- tpj ~ J • 

Proof. If p G Vi, then aiM^J = S{p) by the assumption Tip « S{i)S{p). Simi- 
larly, aiMpj = S{j). Therefore, the frequency of the pattern (ipj) in the stream 
is expected to be afS{i)M-pMpj = S{i)S{p)S{j). In the case p ^ Vi, the same 
frequency is expected to be aiS{i)S{p)M-jK These two expectation are separated 
since ^ S{j) by the assumption. 

Next, we give the subroutine Grow_Components that constructs a partition of 
[n] using the propositions above and the frequencies T. The algorithms uses the 
notation C(i) to denote the component to which i belongs to. 



Grow_Components (T) 

Initialize: Vi G [n], C{i) <— {i} 

Phase 1 : 

For all i,j G [n] 

If Tij fs fifj then 
Union(C(i), C{j)) 

Phase 2 : 

For all i,j,p G [n] such that Tj 9 ^ TTj and Tpj ~ TTpTj 
Union(C(i), C{p)) 

Return: the partition defined by C(-)’s 



Lemma 2 (Soundness). At the end 0 / Grow_Components, if C{i) = C{j) for 
some i,j, then there exists I such that i,j G Vi. 

Proof. At the start of the subroutine, every state is initialized to be a component 
by itself. In Phase 1, two components are merged when there is definite evidence 
that the components belong to the same Markov chain by Proposition 2 or 
Proposition 3. In Phase 2, Tj 96 TiTj implies that i and j are in the same 
component and hence Proposition 4 applies and shows the correctness of the 
union operation performed. 

^ i.e., Tp « S{i)S{p) « fpi and fjp « S{j)S{p) « Tpj. 
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Lemma 3 (Completeness). At the end o/Grow_Components, C{i) = C{j) for 
all i,j such that i,j G Vi for some I and Si(j') for some i',f G Vi. 

Proof. First notice that our algorithm will identify i' and f as being in the same 
component in phase 1. Now if either ^ Si{i) or ^ Si{j') we would 
have identified i as belonging to the same component as %' and j' in phase 1. 
Otherwise, phase 2 allows us to make this determination. The same argument 
holds for j as well. Thus, % and j will be known to belong to the component as 
i' and hence to each other’s component. 



Inf er_Disjoint_MC_Mixtures (X) 

ComputG 4 i , T-'ij a n d 4 ‘ipj 

Let Fi,...,Fm be the partition Grow_Components (T) returns 
For each 1 < I < m 

Considering the substream of X formed by all i £ Vi , calculate 
estimates for the transition probabilities involving i,j G Vi . 



At this point, we can claim that our algorithm identifies the irreducible Markov 
chains in the mixture (and their parameters). For other chains which have 
not been merged, from the contrapositive of the statement of Lemma 3 it must 
be the case that for all i',f G Vi we have = Si{f), and the chains reduce 
to one-step mixing processes. 

Theorem 4. The model output by the algorithm is observationally equivalent to 
the true model with very high probability. 

3.2 Chain-Dependent Mixture Model 

We now consider the model where the mixing process chooses the next chain 
with probabilities that are dependent on the chain that last made a transition. 
As in our algorithm for the simple mixture model, we will start with each state 
in a set by itself, and keep growing components by merging state sets as long as 
we can. 

Definition 3. A triple (i,j,s) satisfying TijTjs ^ TijsTj is termed as a reveal- 
ing triple, otherwise a triple is called non-revealing. 

The following lemma ensues from case analysis. 

Lemma 4. If {i,j,s) is a revealing triple then i and s belong to the same chain 
and j belongs to a different chain. 

The algorithm, in the first part, will keep combining the components of the first 
two states in revealing triples, till no further merging is possible. Since the above 
test is sound, we will have a partition at the end which is possibly finer than the 
actual partition. That is, the state set of each of the original chains is the union 
of some of the parts in our partition. We can show the following: 
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Lemma 5. If i,s € j € M^^\k ^ I, ^ Si{s) and aij ■ ajs 0 then 
is a revealing triple. 

Proof. Given i,j,s as in the statement consider the left hand side of the in- 
equality in Lemma 4. TijTjs « TiaijSk{j)TjCKjsSi{s) and the right hand side, 
TijsTj « TiaijSk{j)ctjsMf^Tj. Evidently, these two expressions are not equal 
whenever ^ Si{s). 

The contrapositive of the above Lemma shows that if the triple (i,j, s) is a non- 
revealing triple where i and s belong to the same chain and TijTjs 9^ 0 then it 
must be the case that j belongs to the same chain as i and s. This suggests the 
following merging algorithm: 

Grow_Components_2 (T) 

Initialize: Vi € [n], C{i) {i} 

Phase 1 : 

For all i,j, s € [n] 

If fijfjs 76 fijsfj then 
Union(C(i), C{s)) 

Phase 2: 

For all i,j,s G [n] such that i, s G C{i) ^ C{j) 

If fijfjs ~ fijsfj f 0 then 
Union(G(i), C{j)) 

Return: the partition defined by G(-)’s 

Thus if the condition cCijOji yf 0 is satisfied and the Markov chain of i is not 
united in a single component, it must be the case that the Markov chain in 
question is observationally reducible to one step mixing. Thus the only remaining 
case to consider are (irreducible) Markov chains (containing i) such that for any 
other chain (containing j) it must be that = 0. 

To handle Markov chains such that for all I' f I and j G \ we have 
aijUji = 0 the algorithm, in the second part, will perform the following steps: 

1. Let Fi{j) = Tij/Ti, i.e., the relative frequency that the next label after an i 
is j. 

2. For all pairs i,j such that fjj f 0, and i and j are still singleton components, 
start with Djj = {i,j}. 

a) If for some state p, Fi (p) 96 Fj (p) , then include p in Djj . 

b) If for some state q, 76 then include q in Dij. 

3. Keep applying the above rules above using all pairs in a component so far 
until Dij does not change any more. 

4. For each starting pair i,j, a set Dij of states will be obtained at the end of 
this phase. Let V be the collection of those Dij’s that are minimal. 

5. Merge the components corresponding to the elements belonging to Dij G D. 
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Lemma 6. For states i and j from separate Markov chains, Dij ^ T>. 

Proof. For any state s in the same chain as i, Fjs = 0, because ajs = 0. 
Therefore, the second closure rule will eventually include all the states from 
to Dij. On the other hand for states i,v such that v € Diy will contain 

states only from Hence, as D^y C Dij, Dij will not be minimal. 

Now we know that each set in 2? is a subset of the state space of a Markov chain. 
Thus, we get 

Theorem 5. Let . . . , be an interleaved process with chain- 

dependent mixing and no one-step-mixing Markov chains. If for all I € [k], 
Oiii yf 0 for i G then we can infer a model observationally equivalent to the 

true model. 

3.3 A Negative Result 

Suppose 22 is a two state probabilistic automaton where the transition proba- 
bilities are Hija where i,j G {1,2}. Let {a} = L be the collection of all possible 
labels output. 

Consider the following mixture process: We will create two Markov chains 
for each label a G L. Each of the Markov chains is a 

markov chain with a single state corresponding to the label a. The transition 
probability from chain to is Hijt. 

Clearly the “states” of the Markov chains overlap ~ and it is easy 

to see that the probability of observing a sequence of labels as the output of H 
is the same as observing the sequence in the interleaved mixture of the Markov 
chains. Since the estimation of H is intractable [1], even for two states (but 
variable size alphabet), we can conclude: 

Theorem 6. Identifying interleaving Markov chains with overlapping state 
spaces under the chain dependent mixture model is computationally intractable. 

4 Non-disjoint State Spaces 

In the previous section we showed that in the chain dependent mixture model 
we have a reasonably sharp characterization. A natural question that arises from 
the negative result is: can we characterize under what conditions can we infer 
the mixture of non-disjoint Markov chains, even for two chains ? A first step 
towards the goal would be to understand the simple mixture model. 

Consider the most extreme case of overlap where we have a mixture of two 
identical Markov chains. The frequency of states in the sequence gives an es- 
timate of the stationary distribution S of each chain which is also the overall 
stationary distribution. Note that = Mij for all i, j. 

Consider the pattern {ij) . This pattern can arise because there was a transi- 
tion from i to j in some chain or it can arise because we first observed i and 
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control shifted to the other chain and we observed j. Let ai be the probability 
that the mixing process chooses Then, 

k 

Tij ~ ^ acS{i){{acMij) + (1 - ac)S{j)). 

Letting w = we can simplify the above equation to get: = 

S{i)[ivMij + (1 — w)S{j)] = S{i)[w{Mij — S{j)) + S'(j)]. Rearranging terms we 

have Mij = ^ + Sj. Any value of w that results in 0 < Mij < 1 for all i, j 

leads to an observationally equivalent process to the one actually generating the 
stream. The set of possible w’s is not empty since, in particular, w = 1 leads to 
Mij = corresponding to having just one Markov chain with these transition 
probabitlities. 

What we see above is that the symmetries in the problem introduced by 
assuming that all Markov chains are identical facilitate the inference of an ob- 
servationally equivalent process. The general situation is more complicated even 
for two Markov chains. 

We consider the mixtures of two Markov chains with non-disjoint state 
spaces. We give an algorithm for this case under a technical condition that 
requires a special state. Namely, we require that there is a state is that is exclu- 
sively in one of the Markov chains, say and 

either > S'i(j) or M-j^ = 0 for all j GVi. 

Let oi, «2 be the mixture probabilities. Then, considering the four possible ways 
of (ij) occurring in the stream, we get 



4- = + 0102 (^i(i)^2(j) + 52 W^i(j)) + 

Let Aij Tij — (SS'^)ij where S = OiS'i -I- 025*2 as before. Then, we can write 

A, = ajSA) {m^P - 5i(j)) + (M[f - S^{j)) . 

Consider the state is required by the technical condition. For any state j 
such that > 0, we have Ai^j = a\Si{is) (^M^^j — 5i(j)^ > 0. For any other 

state j with 5i(j) > 0, Ai^j = —alSi{is)Si{j) < 0. Finally, Ai^j = 0 for all the 
remaining states. 

Since S{is) = Oi5i(fs), for each j G [n], we can infer oi5i(j) from the 
observations above. Hence, we can infer 0252(j) for each j by 5(j) = oi5i(j) -I- 

0252(j). Since we know the vectors 5i, S 2 , we can now calculate Mij oiM^T 
f 2) 

02 Mb ^ for all i,j pairs. 

If state i or j exclusively belongs to one of the Markov chains, Mij gives the 
product of the appropriate mixing parameter and the transition probability. In 
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the case when both states i and j are common between the Markov chains, we 
will use the frequency Tu^j of pattern {iisj) to infer and 
The frequency of the pattern (iisj) is expected to be 

Note that all but the last term is already inferred by the algorithm. Therefore, 
Oi 2 ^ip , hence aiMp\ can be calculated. 

Finally, using the next state distribution for the state is, we can calculate ai 
and 02 • This completes the description of our algorithm. 

5 Conclusions and Open Problems 

In this paper we have taken the first steps towards understanding the behavior of 
a mixture of Markov chains. We believe that there are many more problems to be 
explored in this area which are both mathematically challenging and practically 
interesting. 

A natural open question is the condition an yf 0, i.e., there is a non-zero 
probability of observing the next label from the same Markov chain. We note 
that Freund and Ron had made a similar assumption that an is large, which 
allowed then to obtain “pure” runs from each of the chains. It is conceivable 
that the inference problem of disjoint state Markov chains becomes intractable 
after we allow an = 0. 

Another interesting question is the optimizing the length of the observation 
required for inference - or if sufficient lengths are not available then compute the 
best partial inference possible. This is interesting even for small ~ 50 states and 
a possible solution may be trade off computation or storage against observation 
length. 
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Abstract. The Probably Exact model (PExact) is a relaxation of the 
Exact model, introduced in by Bshouty. In this paper, we show that the 
PExact model is equivalent to the Exact model. 

We also show that in the Exact model, the adversary (oracle) gains no 
additional power from knowing the learners’ coin tosses a-priory. 



1 Introduction 

In this paper we examine the Probably Exact (PExact) model introduced by 
Bshouty in [5] (called PEC there). This model lies between Valiant’s PAG model 
[12] and Angulin’s Exact model [1]. 

We show that the PExact model is equivalent to the Exact model, thus 
extending the results by Bshouty et. al. [8] who showed the PExact model is 
stronger than the PAG model (under the assumption that one way functions ex- 
ist), as well as that the deterministic Exact model (where the learning algorithm 
is deterministic) is equivalent to the deterministic PExact model. 

The PExact model is a variant of the Exact model, in which each coun- 
terexample to an equivalence query is drawn according to a distribution, rather 
than maliciously chosen. The main advantage of the PExact model is that the 
teacher is not an adversary. For achieving lower bounds in the Exact model, (like 
those given by Bshouty in [5]), we must consider a malicious adversary with un- 
bounded computational power that actively adapts its behavior. On the other 
hand, in the PExact model the only role of the adversary is to choose a target 
and a distribution. After that the learning algorithm starts learning without any 
additional adversarial influence. 

For removing randomness from the PExact model, we introduce a new varia- 
tion of the model introduced by Ben-David et. al. in [3]. We call this the Ordered 
Exact (OExact) model. This model is similar to the PExact model, where in- 
stead of a distribution function we have an ordered set. Each time the OExact 
oracle gets an equivalence query, it returns the lowest indexed counterexample, 
instead of randomly or maliciously choosing one. 
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Another model we consider in this work is the random-PExact model, intro- 
duced by Bshouty and Gavinsky [7]. The random-PExact model is a relaxation 
of the PExact model that allows the learner to use random hypotheses. We will 
show that for every algorithm A that uses some restricted random hypothesis 
for efficiently learning the concept class C in the random-PExact model, there 
exists an algorithm ALG that efficiently learns C in the Exact model. 

In additional we show that the adversary does not gain any additional power 
by knowing all coin tosses in advance. In other words, we show that offline-Exact 
learning = Exact learning. 

In [8] Bshouty et al. showed that Exact-learnable PExact-learnable 
PAC-learnable. Based on Blum construction [4] they also showed that under 
the standard cryptographic assumptions (that one-way functions exist), PExact- 
learnable ^ PAC-learnable. In [7], Bshouty and Gavinsky showed that under 
polybit distributions, PExact-learnable = PAC-learnable. In this work we will 
exploit the exponential probabilities to show that PExact-learnable Exact- 
learnable. 

Another model residing between the PAG model and the PExact model is the 
PAExact model introduced by Bshouty et al. in [8] . The PAExact model is similar 
to the PExact model, but allows the learner some exponentially small final error 
(as opposed to the exact target identification required in PExact). Bshouty and 
Gavinsky [7] showed that PAExact-learnable = PAC-learnable using boosting 
algorithms based on [11] and [10]. In [6], Bshouty improves the error factor and 
gives a more simple algorithm for boosting process. 

The following chart indicates relations between the models. 

Exact PAExact 

II t II 

PExact PAG 

We note that this work represents results independently obtained by the 
authors. This joint publication has evolved from a manuscript by Avi Owshanko; 
the other author’s original manuscript [9] may be found at his web page. 

2 Preliminaries 

In the following we formally define the models we use. We will focus on exact 
learning of concept classes. In this setting, there exists some learning algorithm 
A with the goal of exactly identifying some target concept t out of the concept 
class G over a domain X. In this paper we consider only finite and countable 
infinite domains X. The learner A has full knowledge of the domain X and of 
the concept class G, but does not have any a-priory knowledge about the target 
class t. As each concept t G C is a subset of the domain X, we will refer to it as 
a function t : X ^ {0, 1}. 

For learning the target concept, the learner can ask some teacher (also re- 
ferred to as an oracle) several kinds of queries about the target. The teacher 
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can be regarded as an adversary with unlimited computational power and full 
knowledge of all that the learner knows. The adversary must always answer 
queries honestly, though it may choose the worst (correct) answer. If the ad- 
versary knows in advance all the learner’s coin tosses, we call the adversary an 
offline adversary and call the model an offline-model. 

In this paper we will focus on efficient learning under several models. When- 
ever we write efficient learning of some target t with success probability d, we 
mean that the learning algorithm receives the answer “Equivalent” after time 
polynomial in sizec{t), log(l/d) and b (the size of the longest answer that the 
teacher returns). 

We now give the formal definitions of Exact learning [12], PExact learning 
[5] and a new model we denote OExact (which is a variation over the model 
considered in [3]). 

We say that a concept class C is learnahle in some model if there exists 
some algorithm A such that for every target t G C, and each confidence level 
6, A efficiently learns t with the help of the teacher, with success probability 
greater than 1 — <5. We say that a learner is random if it uses coin tosses and 
deterministic otherwise. 

In the Exact model, the learner A supplies the adversary with some hypoth- 
esis h (such that h can be computed efficiently for every point x in X) and the 
adversary either says “Equivalent” , or returns a counterexample, x G X such 
that t{x) yf h{x). 

In the PExact (probably exact) model, the PExact teacher holds some prob- 
ability distribution D over X, as well as the target t G C. Both the target 
and the distribution functions are determined before the learning process starts 
and stay fixed for the duration of the learning process. The learner can supply 
the teacher with some hypothesis h and the teacher either returns “Equivalent” 
(when Prjjlffix) yf t(x)] = 0), or returns some counterexample x. The counterex- 
ample is randomly chosen, under the distribution D induced over all erroneous 
points X G X (that is h{x) yf t{x)). 

In the OExact (ordered exact) model, the OExact oracle holds some finite 
well ordered set S Q X. For each query of the algorithm A , the OExact oracle 
returns x G S where x is the smallest member of S such that h(x) yf t{x). 
For every member x G S, we let Ord{S, x) denotes the number of elements 
in S that are smaller than x (for example, for xq the smallest member of S, 
Ord{S, xq) = 0). 

For the PExact model, There exists some relaxed variation of the PExact 
model, denoted random- PExact, introduced by Bshouty and Gavinsky [7]. In 
this setting, the algorithm A may use a random hypothesis. A random hypothesis 
ft-r : X .R — >■ {0, 1} is a function such that for every input a;o G A" it randomly 

uniformly chooses rg G R and returns hrgffio). As before, the teacher may either 
answer “Equivalent” (when Vx G X : Prffihr{x) yf t{x)] = 0) or returns some 
counterexample x. For choosing the counterexample, the teacher keeps randomly 
choosing points x in X according to the distribution D until the first point such 




PExact = Exact Learning 203 



that hr{x) ^ t{x). For the Exact (OExact) model, the adversary returns some 
(the smallest) point x £ X {x £ S) such that Pr[hr{x) ^ t{x)\ > 0. 

We will also use the following inequality: 

Theorem 1 (Chernoff inequality). Let yi,l 2 ) ■ • • independent ran- 

dom variables such that for 1 < i < n, Pr[F, = 1] = Pi, where Q < Pi < 1. Then, 
for Y = Ef^^Yi, fx=E[Y] = and 0 < A < 1, 

Pr[Y < (1 - X)p.] < 

3 The Learning Algorithm 

In this section we introduce a scheme relying on majority vote to turn every 
algorithm A that efficiently learns a concept class C in the PExact model into 
an algorithm ALG that can learn C in the Exact model. 

We will rely on the fact that you can fool most of the people some of the 
time, or some of the people most of the time, but you can never fool most of 
the people most of the time. Consider some algorithm A where for every target 
t £ C, there exists some bound T, such that A makes no more than T mistakes 
with some probability p. When we run two copies of A , the probability that both 
make mistakes on the same T points (in the same order) is p^. When running k 
copies of A , the probability that all make mistakes on the same points is p^ . But 
this fact is not enough for building a new algorithm, because it is not enough 
for us to know that there is a possible error, but we need to label every point 
correctly. Hence we need to have that the number of points such that more than 
half the running copies of A mislabel is bounded by some factor of T. We will 
prove that if A is an efficient PExact algorithm, then there exists some such 
(efficient) bound T for every target t £ C, and that the number of errors is no 
more than 4T. 

Because the learner does not know the target t in advance, it must find this 
bound T dynamically, using a standard doubling technique — each iteration 
doubling the allowable mistakes number (and the number of copies of H ) until 
successfully learning t. The full algorithm can be viewed in figure 1 

We start by showing that A is an efficient learning algorithm in the OExact 
model. That way, we can remove the element of randomness that is inherent to 
the PExact model. 

Lemma 2. If A learns every target t in C using less than T{f) steps, with the 
aid of a PExact teacher with confidence greater than 0.95, then there exists an 
algorithm A' , (a copy of A ), that learns every target t in C using less than T(t) 
steps, with the aid of an OExact teacher with confidence greater than 0.9. 

Proof: In this proof we build for every well ordered set S and every target 

t £ C a step probability function Ds that will force the PExact oracle to behave 
the same as the OExact oracle (with high probability). 

We will run both algorithms A and A' in parallel, where both use the same 
random strings (when they are random algorithms). Let k be the size of S and 
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ALG 

1. Init fc 4 

2. Do 

3. - Init V <— 0, count <— 0 

4. - Let A {Aoo, Ag, ... , A||} [where each Ai is a copy of A\ 

5. - Init each copy Ai in A. 

6. - While {count < k) 

7. — Run each copy Ai G A until it asks an equivalence qnery, 

— terminates, or execnted more than fc/4 steps. 

8. — Remove from A all copies Ai that either terminated nnsuccessfully 

— or executed more than k/4 steps. 

9. — //there exists some copy Ai G A asking an equivalence query with 

— an hypothesis hi that is not consistent with V 

fO. — Let {y,label{y),c) be a counterexample with the lowest index c. 
fl. — Give (y,label{y)) as a counterexample to Ai 

12. — Jump back to step 7 

13. -- End If 

14. — Let h = majority{h\, /i 2 , . . . , hk} 

— [where hi is Afs hypothesis at this point]. 

15. — Let X t— EQ{h). If x = “Equivalent”, return h as the answer 

— else, Add (x,label{x),count) to V 

16. — Let count <— count + 1 

17. ~ End While 

18. “ Let fc t— 2fc 

19. While the hypothesis h is not equivalent to the target. 



Fig. 1. The learning algorithm 



let I denotes T{t). We define the probability distribution Ds as follows (recall 
that Ord{S,x) denotes the number of elements in S that are smaller than x). 

f 0 x^S 

Ds{x) = ^ c 

( sf^A^oi+2)-* 2; fc D 

Consider the case that both A and A' ask their teachers some equivalence query 
using the same hypothesis h. Let x be the counterexample that the OExact 
teacher returns to A' . By definition of the OExact model, x is the smallest 
counterexample in S. The probability that the PExact teacher returns to A a 
counterexample y such that y ^ x (and Ord{S,y) > Ord{S,x)) is less than 

^ (40l + 2)-2 ^ 2 (40; + 2)-°’'‘^(^’^) Ds{x) 

j=Orh,A+i ^ii(40^ + 2)-* ^ 40/ + 2 ■ rii(40/ + 2)-^ " 20/ + 1 

Hence, the PExact oracle returns the lowest indexed counterexample x with 
probability greater than 1 — 
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We can conclude that the PExact and the OExact teachers return the same 
answer with probability greater than 1 — 251 ; the probability for I such 
consequent answers is greater than 

Because both A and A' hold the same random string, they will both behave the 
same (ask the same queries) until the first time that the teachers give different 
answers. On the other hand, A learns t using less than T{t) steps with confidence 
of 0.95. So we can conclude that with confidence greater than 0.95 • 0.95 > 0.9, 
A! learns t in the OExact model using less than T{t) steps. ■ 

Our next step is to show that if A is an efficient OExact learning algorithm, 
then ALG learns C in the Exact model. 

Lemma 3. Let X he a finite domain. If A learns every class t in C using less 
than T(f) steps, with the aid of an OExact teacher with confidence level greater 
than 0.9, then ALG learns every t in G with the aid of an offline-exact teacher, 
with probability greater than 1 — <5 using less than 

0((log(l/5) + T{t) log(|X|) + log(|C|))2) 

steps. 

Proof: Let I denotes T{t) and let m>mo = 20(ln(l/5) + 4nn(|X|) + ln(|C'|)). 

Consider running 3m copies of the learning algorithm A , over some given ordered 
set S of size 4/. We shall calculate the probability that m of these copies need 
more than I steps to exact learn t. 

Using Chernoff inequality (1), we have n = 3m, p, = 0.9 • 3m = 2.7m, and 
A > 0.2: 

Pr[Y < 2m] < (1) 

Next we define the following property: 

Property I: The probability that there exists some target t € C and some ordered 
set S of size 41 such that more than m copies of A will need more than I steps 
to learn t is less than S. 

The reasoning behind this claim is as follows. Assume that all 3m copies of 
A have a sequence of random bits. We let the adversary know these random bits 
and look for some target t G G and some ordered set S that will cause more 
than m copies to fail. The number of possible target concepts t G C is jCj and 
the number of possible ordered sets is less than jAj^b On the other hand, the 
probability for some set to cause more than m copies to fail for some target t is 
less than ]^x\^r\c\ (^)- Hence the probability for the existence of such a bad 

target t and ordered set S is less than 

■ |v|“ ■ |c| = 



and property I holds. 
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We now consider ALG ’s main loop (steps 6-17 in figure 1) when 6mo > fc > 
3mo {ALG reaches this loop after after 0{k‘^) steps, unless it already received 
the answer “Equivalent”). Assume that ALG receives 4? counterexamples in 
this loop (recall that 41 < k). Note that this set of counterexamples defines an 
ordered set S of size 41 (we order the counterexamples chronologically) . Because 
each such counterexample is given to at least half the currently running copies 
of A , at least m copies of A received at least I counterexamples (or executed 
more than k/4 > I steps). But property I states that there exists such a set of 
counterexamples with probability smaller than 5. 

So we conclude that with probability greater than 1 — 5, ALG learns t in the 
Exact model when 6mo > k, where the number of steps is bounded by 

0{ml) = 0{{\og{l/6) + T{t)\og{\X\)+\og{\G\)f). 



Our next step is to remove the size of the domain X and the concept class 
G from the complexity analysis. 

Lemma 4. If A learns every class t in G using less than T{t) steps, with the 
aid of an OExact teacher with confidence level greater than 0.9, then ALG learns 
every t in G with the aid of an offline-exact teacher, with probability greater than 
1 — 5 using less than 

0((log(l/5) -k T{t){size{t) + b)f) 

steps, where b is the size of the longest counterexample that the teacher returns. 

Proof: For some set Q, we let denotes all members of Q that are represented 
by no more than b bits. By definition, IQ**! < 2^+^. By lemma 3, there exists 
some constant c, such that for every finite domain X, ALG learns every t in G 
with the aid of an offline-exact teacher with probability greater than 1 — 5 using 
less than c • (log(l/5) -k T{t) logdXl) -k log(|C|))^ steps. 

Let us consider the case that the longest counterexample b, or the size of the 
target t {sizec{t)) is at least 2* and less than 2*+^. We let d denotes 2*. So we 
have that d < sizeft) -k b. Applying lemma 3, we get that ALG learns t with 
probability greater than 1 — 5/d, using less than 

c-(log(l/5) + T(t)log(|X|)+log(|C|))2 
< c • (log(l/5) -k T{t) log(d) -k log{d)f 
= c ■ (log(l/5) -k (T(t) -k l){size{t) + 6))^ 

steps. Hence, the probability to find some d = 2* such that ALG will be forced 
to use more than c- {poly{size{t)) ■ log^(d/5) • IGd"^) steps is less than: 

oo ^ oo ^ 

i=l i=l 

and the lemma holds. ■ 

At this point we can conclude that: 
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Theorem 5. PExact = offline-Exact learning. 

Proof: This theorem immediately follows from lemmas 2 and 4. In lemma 

2 we showed that every algorithm A that efficiently learns the class C in the 
PExact model with probability greater than 0.95 also efficiently learns C in the 
OExact model with probability greater than 0.9. In lemma 4 we showed that if 
A efficiently learns C in the OExact model with probability greater than 0.9, the 
algorithm AEG efficiently learns C in the offiine-Exact model with any needed 
confidence level 1 — 15. On the other hand, Bshouty et. al. [8] already showed that 
Exact PExact. Hence the theorem holds. ■ 

An additional interesting result following immediately from theorem 5 is: 

Corollary 6. Exact = offtine-Exact learning. 

4 Handling the Random Model 

We now show that if A is an efficient algorithm for learning C in the random- 
PExact model and if A follows some constraints, then AEG learns G in the 
Exact model. Namely, we will show that if we can efficiently determine for every 
hypothesis hr that A produces and for every x G X whether 0 < E[hr{x)] < 1 or 
not, then if A learns G in the random-PExact model, AEG learns G in the Exact 
model. As in the previous section, we start by showing that random-PExact = 
OExact. 

Lemma 7. If A efficiently learns G in the random-PExact model with proba- 
bility greater than 0.95, then A efficiently learns C in the OExact model with 
probability greater than 0.9. 

Proof: This proof is similar to that of Lemma 2. For every target t G G 

and every order S G X we build a step distribution function that will force the 
random-PExact oracle to behave in the same way as the OExact oracle. 

Let k be the size of S and assume that that A needs I = poly{size{t)) 
Consider running A for I steps in the OExact model until A executes I steps 
(or terminates successfully). Let /i* denotes A’s hypothesis after the i’s step. 
Because the number of steps is bounded by I, there exists some 0 < A < 1 such 
that for all members x G S and all steps 0 < i < I, 

(E[hl.{x)] = 0) V {E[hl.{x)] = 1) V (A < E[hl.{x)] < 1 - A). 

Using this value A, We define the probability distribution Ds as follows 

0 X S 

/ , \Ord{S,x) 

(401+2) ■ ffk / ^ y X G S 

For every x member of S , We let Y{x) C S denotes all members of S larger 
than X in the order S. By definition of Ds, we have 

Ds{x) > ^ • Sy(zY(x)Ds{y)- 
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From this point on, the proof is similar to that of Lemma 2. The probability 
to receive the smallest possible x as the counterexample in the random-PExact 
model under the probability distribution Ds is (at least) 201+1 ’ proba- 

bility that the random-PExact oracle behaves the same as the OExact oracle for 
all I steps is greater than 0.95. So we conclude that A learns C in the OExact 
model with probability greater than 0.9. ■ 

After we showed that random-PExact = OExact, we can apply the same 
proofs as in the previous section to receive the following result: 

Theorem 8. If A efficiently learns C in the random-PExact model, and if for 
every hypothesis hr that A holds and every x € X we can (efficiently) determine 
whether 0 < E[hr{x)] < 1 or not, then AEG efficiently learns C in the Exact 
model. 

Proof: The proof is similar to that of the theorem 5. We can still emulate the 

way that the OExact oracle behaves, because for every hypothesis hr and every 
X G X we can efficiently determine whether 0 < E[hr{x)] < 1 or not. When 
hr can assign x both values, we can give a: as a counterexample. Otherwise, we 
can choose any random string r (for example all bits are zero) and calculate the 
value of hr{x). Also note that if x is a counterexample for AEG , then at least 
half of the running copies of A can receive a: as a counterexample. So we can use 
both lemmas 2 and 4. The rest of the proof is similar. ■ 

5 Conclusions and Open Problems 

In this paper we showed that PExact = Exact learning, thus allowing the use of 
a model without an adaptive adversary, in order to prove computational lower 
bounds. We also showed that a limited version of random-PExact is equivalent 
to that of the Exact model. An interesting question left open is whether the 
random-PExact is strictly stronger than the Exact model or not (assuming that 
Pf^NP). 

The second result we gave is that even when the adversary knows all the 
learner’s coin tosses in advance (the offline- Exact model), it does not gain any 
additional computational power. This results also holds when the learner has the 
help of a membership oracle, but it is not known whether this still holds when 
the membership oracle is limited, such as in [2]. 
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Abstract. We consider the problem of learning a general graph using edge- 
detecting queries. In this model, the learner may query whether a set of ver- 
tices induces an edge of the hidden graph. This model has been studied for 
particular classes of graphs by Kucherov and Grebinski [1] and Alon et al.[2], 
motivated by problems arising in genome sequencing. We give an adaptive de- 
terministic algorithm that learns a general graph with n vertices and m edges 
using 0(m log n) queries, which is tight up to a constant factor for classes of 
non-dense graphs. Allowing randomness, we give a 5-round Las Vegas algorithm 
using 0{m log n -|- yTn log^ n) queries in expectation. We give a lower bound of 
f2{{2mlrY^'^) for learning the class of non-uniform hypergraphs of dimension 
r with m edges. For the class of r-uniform hypergraphs with bounded degree d, 
where d < we give a non-adaptive Monte Carlo algo- 

rithmusing 0(dn log n) queries, which succeeds with probability at least l — 
where c is any constant. 



1 Introduction 

The problem of learning a hidden graph is the following. Imagine that there is a graph 
G = (y, E) whose vertices are known to us and whose edges are not. We wish to 
determine all the edges of G by making edge-detecting queries of the following form 

Qa{S) : does S include at least one edge ofG? 

where S' C y. The query Qg{S) is answered 1 or 0, indicating whether S con- 
tains both ends of at least one edge of G or not. We abbreviate Qg{S) to Q{S) whenever 
the choice of G is clear from the context. The edges and non-edges of G are completely 
determined by the answers to Q{{u,v}) for all unordered pairs of vertices u and w; 
however, we seek algorithms that use significantly fewer queries when G is not dense. 

This type of query may be motivated by the following scenario. We are given a set 
of chemicals, some pairs of which react and others don’t. When multiple chemicals 
are combined in one test tube, a reaction is detectable if and only if at least one pair 
of the chemicals in the tube react. The task is to identify which pairs react using as 
few experiments as possible. The time needed to compute which experiments to do is a 
secondary consideration, though it is polynomial for the algorithms we present. 

An important aspect of an algorithm in this model is its adaptiveness. An algorithm 
is non-adaptive if the whole set of queries it makes is chosen before the answers to any 
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queries are known. An algorithm is adaptive if the choice of later queries may depend on 
the answers to earlier queries. Although adaptiveness is powerful, non-adaptiveness is 
desirable in practice to permit the queries (or experiments) to be parallelized. A multiple- 
round algorithm consists of a sequence of rounds in which the set of queries made in 
a given round may depend only on the answers to queries asked in preceding rounds. 
Since the queries in each round may be parallelized, it is desirable to keep the number 
of rounds small. A non-adaptive algorithm is a 1 -round algorithm. 

Another important aspect of an algorithm is what assumptions may be made about 
the graph G; this is modeled by assuming that G is drawn from a known class of graphs. 
Previous work has mainly concentrated on identifying a graph G drawn from the class 
of graphs isomorphic to a fixed known graph. The cases of Hamiltonian cycles and 
matchings have specific applications to genome sequencing, which are explained in the 
papers cited below. Grebinski and Kucherov [1] give a deterministic adaptive algorithm 
for learning Hamiltonian cycles using 0{n log n) queries. Beigel et al. [3] describe a 8- 
round deterministic algorithm for learning matchings using O(nlogn) queries, which 
has direct application in genome sequencing projects. Alon etal. [2] give a 1 -round Monte 
Carlo algorithm for learning matchings using 0{n log n) queries, which succeeds with 
probability at least 1 — On the other hand, they show a lower bound of 
for learning matchings with a deterministic 1 -round algorithm. They also give a nearly 
matching upper bound in this setting. Alon and Asodi [4] give bounds for learning stars 
and cliques with a deterministic 1 -round algorithm. Considerable effort has been devoted 
to optimizing the implied constants in these results. 

In this paper, we are interested in the power of edge-detecting queries from a more 
theoretical point of view. In particular, we consider the problem of learning more gen- 
eral classes of graphs. Because of this focus, in this paper, we are more interested in 
asymptotic results than optimizing constants. 

Let n denote the number of vertices and m the number of edges of G. Clearly n is 
known to the algorithm (since V is known), but m may not be. In Section 3, we give 
a deterministic adaptive algorithm to learn any graph using 0(m log n) queries. The 
algorithm works without assuming m is known. For Hamiltonian cycles, matchings, and 
stars, our algorithm uses 0(n log n) queries. In Section 4, we give a l-round Monte Carlo 
algorithm for all graphs of degree at most d using 0{dn log n) queries that succeeds with 
probability at least assuming d is known. Note Hamiltonian cycles and matchings 

are both degree bounded by constants. This algorithm takes 0(n log n) queries in both 
cases. In Section 5, we consider constant-round algorithms for general non-dense graphs. 
W first briefly describe a 4-round Las Vegas algorithm using 0{m log n y/m log^ n) 

queries in expectation, assuming m is known. If m is not known, we give a 5-round 
Las Vegas algorithm that uses as many queries. Note 0{y/m\o^ n) is negligible when 
m = l7(log^ n). Therefore, the 5-round algorithm achieves O(logn) queries per edge 
unless the graph is very sparse, i.e. m = o(log^ n). 

In Section 6 we consider the problem of learning hypergraphs. The information- 
theoretic the lower bound implies that fl{rm log n) queries are necessary for learning the 
class of hypergraphs of dimension r with m edges. We show further that no algorithm can 
learn this class of hypergraphs using o((2m/r)’’/^) queries. However, non-uniformity 
of hypergraphs does play an important role in our construction of lower bound. Thus we 
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leave the problem of the existence of an algorithm for r-uniform hypergraphs with m 
edges using 0{rm log n) queries open. On the other hand, we show that hypergraphs of 
bounded degree d, where are learnable with 0(c?n log n) 

queries using a Monte Carlo algorithm, which succeeds with probability at least 1 — n~‘^. 

The graph learning problem may also be viewed as the problem of learning a mono- 
tone disjunctive normal form (DNF) boolean formula with terms of size 2 using member- 
ship queries only. Each vertex of G is represented by a variable and each edge by a term 
containing the two variables associated with the endpoints of the edge. A membership 
query assigns 1 or 0 to each variable, and is answered 1 if the assignment satisfies at least 
one term, and 0 otherwise, that is, if the set of vertices corresponding to the variables 
assigned 1 contains both endpoints of at least one edge of G. Similarly, a hyperedge with 
r vertices corresponds to a term with r variables. Thus, our results apply also to learning 
the corresponding classes of monotone DNF formulas using membership queries. The 
graph-theoretic formulation provides useful intuitions. 



2 Preliminaries 



A hypergraph is a pair H = (V, E) such that if is a subset of the power set of V, where 
V is the set of vertices and E is the set of edges. A set S is an independent set of G if it 
contains no edge of iJ. The degree of a vertex is the number of edges of El that contain 
it. If S' is a set of vertices, then the neighbors of S are all those vertices v not in S such 
that {u, t;} is contained in an edge of H for some u G S. We denote the set of neighbors 
of S by C(S) . The dimension of a hypergraph H is the cardinality of the largest set in E. 
H is said to be r-uniform if E contains only sets of size r. In a r-uniform hypergraph, 
a set of vertices of size r is called a non-edge if it is not an edge of H. 

A undirected simple graph G with no self loops is a just 2-uniform hypergraph. Thus 
the edges of G = (V, E) may be considered to be a subset of the set of all unordered 
pairs of vertices of G. A c-coloring of a graph G is a function from V to {1, 2, . . . , c} 
such that no edge of G has both endpoints mapped to the same color. The set of vertices 
assigned the same color by a coloring is a color class of the coloring. 

We divide a set S in half by partitioning it arbitrarily into two sets Si and S 2 such 
that 1 ^ 1 1 = LI5I/2J and 15*2 1 = [1^1/21- 

Here are two inequalities that we use. 

Proposition 1. ^0 < a; < 1, then 



l-x< e"“. 



Proposition 2. ^0 < a; < 1, 



(l-a:)i> 



2(1 -a;) 
e(2 — x) 



> 



(1 - a;) 



e 
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3 An Adaptive Algorithm 

The main result of this section is the following. 

Theorem 1. There is a deterministic adaptive algorithm that identifies any graph G 
drawn from the class of all graphs with n vertices using O(mlogn) edge-detecting 
queries, where m is the number of edges of G. 

By a counting argument, this upper bound is tight up to a constant factor for certain 
classes of non-dense graphs. 

Theorem 2. f?(emlogn) edge-detecting queries are required to identify a graph G 
drawn from the class of all graphs with n vertices and m = edges. 

We begin by presenting a simple adaptive algorithm for the case of finding the edges 
between two known independent sets of vertices in G using 0(log n) queries per edge. 
This algorithm works without priori knowledge about s. 

Lemma 1. Assume that S\ and S 2 are two known, nonempty independent sets of vertices 
in G. Also assume that |S'i| < |S' 2 | and there are s edges between S\ and S 2 , where 
s > 0. Then these edges can be identified by a deterministic adaptive algorithm using 
no more than 4s (log |S' 2 | + 1) edge-detecting queries. 

Proof. We describe a recursive algorithm whose inputs are the two sets and S 2 . If 
both Si and S 2 are singleton sets, then there is one edge connecting the unique vertex 
in S'! to the unique vertex in S 2 . 

If exactly one of and S 2 is a singleton, suppose w.l.o.g it is ^i. Divide S 2 into 
halves S '21 and S 22 and query the two sets S\ U S '21 and S\ U 822 - For j = 1,2, solve 
the problem recursively for S'! and if the query on U S 2 j is answered 1. 

Otherwise, both and S 2 contain more than one vertex. Divide each Si into halves 
Sii and Si 2 and query the four sets Sij U 82 k for j = 1,2 and k = 1,2. For each query 
that is answered 1, solve the problem recursively for Sij and 82 k- 

If we consider the computation tree for this algorithm, the maximum depth does not 
exceed log 1 52 1 + 1 and there are at most s leaves in the tree (corresponding to the s edges 
of G that are found.) At each internal node of the computation tree, the algorithm asks 
at most 4 queries. Therefore, the algorithm asks at most 4s(log 1^2 1 + 1) queries. □ 

If Si and S 2 are not independent sets in G, the problem is more complex because we 
must eliminate interference from the edges of G induced by Si or S 2 . If we happen to 
know the edges of G induced by 5i and 82 , and we color the two induced graphs, then 
each color class is an independent set in G. Then the edges between a color class in 5i 
and a color class in 82 can be identified using fhe algorifhm in Lemma 1. Because every 
edge between Si and 82 belongs to one such pair, it suffices to consider all such pairs. 
The next lemma formalizes this idea. 

Lemma 2. For i = 1,2 assume that Si is a set of vertices that includes Si edges ofG, 
where Si and S 2 are not both 0, and assume that these edges are known. Also assume 
that |5i| < 1^2 1 and there are s > 0 edges between Si and S 2 . Then these edges can 
be identified adaptively using no more than 4(s log |52 1 + s + si + S 2 ) edge-detecting 
queries. 
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We observe the following fact about vertex coloring. 

Fact 1. A graph with m edges can be \ s/2ra + ij -colored. Furthermore, the coloring 
can be constructed in polynomial time. 

To see this, we successively collapse pairs of vertices not joined by an edge until we 
obtain the complete graph on t vertices, which can be f-colored and has < m 

edges. This yields a f-coloring of the original graph because no edge joins vertices that 
are collapsed into the same final vertex. 

Proof, (of Lemma 2) Using the preceding Fact 1, fori = 1, 2, we may color the subgraph 
of G induced by Si using at most \_\/2si + ij colors. Each color class is an independent 
set in G. The edges between and S 2 can be divided into the sets of edges between 
pairs of color classes from and S 2 . For each pair of color classes, one from Si and one 

from S 2 , we query the union of the two classes to determine whether there is any edge of 
G between the two classes. If so, then using the algorithm in Lemma 2, we can identify 
the edges between the two classes with no more than 4(log 1 52 1 + 1) queries per edge. To 
query the union of each pair of color classes requires at most ( \^/2sl + 1 J ) ( + ij ) 
queries, which does not exceed (1 + s/2){s\ + S 2 ) + 1. Thus, in total, we use no more 
than 4 (slog |52| + s + si + S 2 ) edge-detecting queries. □ 

Now we are able to present our adaptive algorithm to learn a general graph G = (V, E) 
with 0(log n) queries per edge. One query with the set V suffices to determine whether 
E is empty, so we assume that \E\ > 0. 



Algorithm 1 (Adaptive algorithm) 

1: If |y I = 2, mark the pair of vertices in V as an edge and return. 

2: Divide V into halves Si and S 2 . Ask Q(Si) and Q(S 2 ). 

3: Recursively solve the problem for Siif Q (Si) = 1, for i = 1,2. 

4: Use the algorithm in Lemma 2 to identify the edges between 5i and S 2 . 



Proof, (of Theorem 1) We give an inductive proof that the algorithm uses no more than 
12m log n edge-detecting queries to learn a graph G with n vertices and m > 0 edges. 
This clearly holds when n = 2. Assume that for some n > 3, every graph with n' < n 
vertices and m' > 0 edges is learnable with at most 12m' log n' edge-detecting queries. 
Assume 5i includes Si edges of G, for z = 1, 2. Since |52| > 1 5i | , the number of queries 
required to learn G is at most 

(12(si -I- S 2 ) + 4(to — Si — S 2 )) log |52| + 4m -f 2 

using the inductive hypothesis and Lemma 2. 

We know that log |52| < log((n + l)/2) < logn — 1/2, when n > 3. Then for 
n > 3, the above expression is at most 12m logn because m > 1. This concludes the 
induction. □ 

This shows that any graph is adaptively learnable using 0(log n) queries per edge. This 
algorithm can be parallelized into 0(log^ n) nonadaptive rounds; in subsequent sections 
we develop randomized algorithms that achieve a constant number of rounds. 
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4 Bounded Degree Graphs 

In this section, we present a randomized non-adaptive algorithm to learn any graph with 
bounded degree d, where we assume that d = o{n) and d is known to the algorithm. The 
algorithm uses 0{dn log n) queries and succeeds with probability at least 1 — n“'^. Our 
algorithm is a generalization of that of Alon et al. [2] to learn a hidden matching using 
O(nlogn) queries. In contrast to their results, we use sampling with replacement and 
do not attempt to optimize the constants, as our effort is to map out what is possible in 
the general case. 

The key observation is that every pair of vertices in S is discovered to be a non-edge 
of G if Q{S) = 0. The algorithm asks a set of 0{dn log n) queries with random sets of 
vertices with the goal of discovering all of the non-edges of G. 

For a probability p, a p-mndom set P is obtained by including each vertex indepen- 
dently with probability p. Each query is an independently chosen p-random set. After 
all the queries are answered, those pairs of vertices that have not been discovered to be 
non-edges are output as edges in G. The algorithm may fail by not discovering some 
non-edge of G, and we bound the probability of failure by for an appropriate choice 
of p and number of queries. 

For a given non-edge {t6, f} in G, the probability that both u and v are included in 
a /7-random set P is . Given that u and v are included in P, the probability that P 
has no edge of G is bounded below using the following lemma. Let Nq (p) denote the 
probability that a p-random set includes no edge of G. 

Lemma 3. Suppose I is an independent set in G, and P{I) is the set of neighbors of 
vertices in I. Suppose P is a p-random set. Pr{Q{P) = 0|J C P} is at least 

(1 • Ng{p). 

Proof. Let G' be the induced subgraph of G on V — I — P{I). It is easy to verify 
that Ng'{p) > Ng{p). Independence in the selection of the vertices in P implies that 
Pr{Q{P) = 0|/ C P} is the product of the probability that P contains no vertices in 
P{I), which is (1 — and the probability that given the previous event P has no 

edge of G, which is Nq' (p). □ 

By the union bound, we know that iVc (p) > 1 — mp^. Also, P({u, ?;}) < 2c?because 
the degree of each vertex of G is bounded by d. Therefore, 

Pr{Q{P) = Olu, f G P} > (1 — p)^'^(l — mp^) 

> 1 — 2dp — mp^ 

Since d is asssumed to be known to the algorithm, we choose p = Ijs/dn. Then the 
above expression is at least 1 — 2y/djn — mjdn> 1/2 — o(l). (Recall that we assume 
d = o(n)). Therefore, the probability {m, t;} is shown to be a non-edge of G by one 
random query is at least 
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The probability that a non-edge {u, v} is not discovered to be a non-edge using 
6(1 -I- o(l)) ■ dnlnn queries is at most n~^ (using Proposition 1). Thus, the probability 
that some non-edge of G is not discovered after this many queries is bounded by 
Note that we can decrease this probability to by asking c times more queries. 
Therefore, we have proved the following. 

Theorem 3. There is a Monte Carlo non-adaptive algorithm that identifies any graph G 
drawn from the class of graphs with bounded degree d with probability at least 1 — 
using 0{dnlogn) edge-detecting queries, where n is the number of vertices and c is 
any constant. 

For d-regular graphs, this algorithm uses 0{m log n) queries. In particular, for matchings 
and Hamiltonian cycles, the algorithm uses 0{n log n) queries. 



5 Constant-Round Algorithms 

The algorithm in the previous section is not query-efficient when G is far from regular, 
e.g. we get a bound of 0{n^ log n) to learn a star with only n — 1 total edges, because 
the maximum degree is large. To obtain a query-efficient algorithm for a more general 
class of graphs, we consider constant-round algorithms, in which the set of queries in a 
given round may depend on the answers to queries in preceding rounds. For each round 
of the algorithm, a pseudo-edge is any pair of vertices that has not been discovered to 
be a non-edge of G in any preceding round; this includes all the edges of G and all the 
(as yet) undiscovered non-edges of G. 

In a multiple-round algorithm, there is the option of a a final cleanup round, in which 
we ask a query for each remaining pseudo-edge, yielding a Las Vegas algorithm instead 
of a Monte Carlo algorithm. For example, if we add a cleanup round to the algorithm in 
the previous section, we get a 2-round Las Vegas algorithm that always answers correctly 
and uses 0{dn log n) queries in expectation. 

The algorithm in the previous section assumes d is known. In this section, we first 
sketch the intuitions of a 4-round Las Vegas algorithm that learns a general graph using 
an expected 0{m log n -f log^ n) queries, assuming m is known. We then develop 

a 5-round Las Vegas algorithm that learns a general graph using as many queries without 
assuming m is known. 

Each vertex of G is classified as a low-degree vertex, if ifs degree does nof exceed 
i/to, or a high-degree vertex ofherwise. A non-edge of G is a low-degree non-edge if 
both vertices in the pair are low-degree vertices. 

For the first round we choose the sample probability p = \ jsj2m. (Recall that we 
are assuming to is known in this sketch.) Using Lemma 3, the probability that a particular 
low-degree non-edge of G is shown to be a non-edge by a query with a p-random set is 
at least 



p2.(l-p)v^(l_m-p2) 

which is 17(1 /to). Thus, 0{mlogn) queries withp-random sets suffice to identify all 
the low-degree non-edges of G in the first round with probability at least 1 — n“^. 
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Because the number of high-degree vertices is at most 2^/m, we can afford to query 
all pairs of them in the cleanup round. We therefore concentrate on non-edges containing 
one high-degree and one low-degree vertex. To discover these non-edges, we need a 
smaller sampling probability (p = o(l/ \/rn)), but choosing a sample probability that is 
too small runs the risk of requiring too many queries. 

The right choice of a sampling probability p differs with the degree of each individual 
high-degree vertex, so in the second round we estimate such p’s. In the third round, we 
use the estimated p’s to identify non-edges containing a high-degree and a low-degree 
vertex. In the cleanup round we ask queries on every remaining pseudo-edge. In fact, 
since the actual degrees of the vertices are not known, the sets of high-degree and low- 
degree vertices must be approximated. 

The above sketches the intuitions for a 4-round algorithm when m is known. If m 
is unknown, one plausible idea would be to try to estimate m sufficiently accurately by 
random sampling in the first round, and then proceed with the algorithm sketched above. 
This idea does not seem to work, but analyzing it motivates the development of our final 
5 -round algorithm. 

First we have the following “obvious” lemma: as we increase the sampling probability 
p, we are more likely to include an edge of G in a p-random set. It can be proved by 
expressing Na{p) as a sum over all independent sets in G, grouped by their sizes, and 
differentiating with respect to p. 

Lemma 4. Assuming m > 0, Ng{p) is strictly decreasing as p increases. 

It follows that since Ng{0) = 1 and Ng{1) = 0, there exists a unique Pi.{G) such 
that N{pt.{G)) = 1/2. In other words, Pi.{G) is the sampling probability p that makes 
an edge-detecting query with a p-random set equally likely to return 0 or 1 , maximizing 
the information content of such queries. 

It is plausible to think that p*(G) will reveal much about m. However, Pt.{G) also 
depends strongly on the topology of G. Consider the following two graphs: Gm, a 
matching with m edges, and Gg, a star with m edges. We have 

NaM = a-P^r 

Ngs(p) = 1 - p + p(l - p)™ 



Therefore, we have p*(Gm) = 0{^/^/m) butp*(Gg) > 1/2. We believe that such 
a gap inp*(G)’s of two different topologies lies behind the difficulty of estimating m in 
one round. 

Although our effort to estimate m has been thwarted, Pt{G) turns out to be the 
sampling probability that will help us identify most of the non-edges in the graph. We 
will use N{p) instead of Ng{p) and p* instead of Pt.{G) when the choice of G is clear 
from the context. 

First, we have rough upper and lower bounds for p* . 



1 



s/2m 



<P*< 



V2 

2 ’ 



observing that 1 — mp^ < N{p) < 1 — p^. The fact that p* helps us identify most of 
the non-edges is made clear in the following two lemmas. 
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Lemma 5. Let {u, be a non-edge ofG in which the degrees ofu and v do not exceed 
2/p*. Then a query on a pt^-random set identifies {m, v} as a non-edge with probability 
at least I7(l/m). 

Proof. According to Lemma 3, the probability that the non-edge {u, v} is identified by 
a query on a p* -random set is at least 

Weknowthatp* < s/2 /2. According to Proposition 2, {1 —p*)'^^^* = 12(1). Combining 
this with the facts thatp* > \js/2m and IV (p*) = 1/2, we have that the probability is 
I7(l/m). □ 

Examining the proof of Lemma 5 we can see that rather than requiring the sampling 
probability p be exactly p* , it is sufficient to require upper and lower bounds as follows: 
p = f2{l/s/T^ andp < p*. 

Corollary 1. We can identify with probability at least I7(l/m) any non-edge with the 
degrees of both ends no more than 2/p* by a query on a p-random set, where p = 
Q{1/ s/m) andp < p*. 



Lemma 6. There are at most 1 /p* vertices that have degree more than 2 /p*. 

Proof Suppose that there are h vertices that have degree more than 2 /p* . Let P be a 
p* -random set. Given that one of the h vertices is included in P, the probability that P 
has no edge in G is at most (1 — p*)^/^* < 1/e^. The probability that P contains none 
of the h vertices is at most (1 — p*)^. Therefore, the probability P has no edge of G is 
at most 



(l-(l-p*)^)-4 + (l-P*)'‘-l 

<l(l + (e2-l).e-^>*'^) 

which should be no less than 1 /2. Thus we have > 1/e. Therefore h < 1 /p* . □ 

Recalling that p* > 1/ s/2m, we have the following. 

Corollary 2. There are at most 0{s/m) vertices that have degrees more than 2/p*. 

The 5-round algorithm is shown in Algorithm 2. Its correctness is guaranteed by the 
cleanup round, so our task is to bound the expected number of queries. For this analysis, 
we call a vertex a low-degree vertex if its degree is at most 2 /p* and call it a high-degree 
vertex otherwise. The non-edges consisting of two low-degree vertices are called low- 
degree non-edges. In the following, we will show that each round will succeed with 
probability at least 1 — given that the previous rounds succeed. 

First we show that with high probability p' exists and satisfies our requirement for 
the second round. 
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Algorithm 2 the 5 -round algorithm 

1. (Estimate p*) Let pi = 2*/n for i = 0, . . . , [log nj . For each i, choose and query a pi- 
random set 0(logn) times. Let the average outcome of edge-detecting queries on pi -random 
sets be 1 — IVi. Let p' = (1/2) min {pi\Ni < 5/8}. Go to the 5th round if p' doesn’t exist. 

2. (Low-degree edges) Choose and query a p' -random set 0((l/p'2) log n) times. 

3. (Degree estimation of high-degree vertices) Let E' be the set of pseudo-edges after the 
second round. Let G' = (V, E'). 

a) Divide V into two sets H and L according to their degrees in G'. L contains the vertices 
that have degrees at most 3/p' and H contains the rest of the vertices. 

b) For each vertex u in H and each pi, query 0(logn) times the union of {m} and a 
Pi -random set. 

c) Let 1 — be the average outcome of random queries with probability pi on vertex u. 
Letpu = max {pi I >1/5-1- l/(2e)} ifpu exists. 

4. (Edges between high-degree and low-degree vertices) For each vertex u £ H such that pu 
exists, query the union of {m} and a pu-random set 0((l/pu) log n) times. 

5. (Cleanup) Query every remaining pseudo-edge. 



Lemma 7. p' < p* and p' = fl(\j \fm) with probability at least 1 — n 

Proof. Let pj = minjpi \pi > p* } . Obviously Pj exists and we have pj < 2p» . First we 
observe that with high prohahility 



, 1 

P < ^Pj < P* 

The probability that the above inequality is violated is Pr[p' > {l/2)pj\ < Pr\Nj > 
5/8]. We know that N{pj) < N{pf) < 1/2. According to Hoeffding’s inequality [5], 
we can make the probability at most 1 / {2nf) hy asking 6>(log n) queries. 

Also by Hoeffding’s inequality, we have Pr[lV(2p') > 3/4] < l/(2n^). Therefore, 
wehave 1— m(2p')^ < N{2p') < 3/4, and hence p' > l/(4-ym). Thus with probability 
at least 1 — we have l/4^/rn <p'< p*. □ 

Using Corollary 1, we can conclude that if the above inequalities are true, by asking 
0{m log n) queries, we can guarantee with probability at least 1 — that a given low- 
degree non-edge is identified in the second round. So we can guarantee with probability 
at least 1 — that every low-degree non-edge is identified in the second round. 

Suppose that we identify all of the low-degree non-edges in the second round. All the 
low-degree vertices must fall into L, since their degrees in G' are at most 3/p, (which 
is at most \H\ < l/p* more than their true degrees). 

However, L may also contain some high-degree vertices. At most 1 /p* high-degree 
vertices fall into L, and their degrees are bounded by 3/p'. Note that both 1 /p' and 1 /p* 
are O(y'm) . The total number of pseudo-edges incident with high-degree vertices in L is 
therefore bounded by 0(m) . Also, the number of pseudo-edges between pairs of vertices 
in H is bounded by 0{m). As stated before, they can be identified in the cleanup round 
with 0{m) queries. We will therefore analyze only the behavior of non-edges between 
vertices in H and low-degree vertices in L in the third and fourth round. 
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We will now show is what we want for each vertex u € H. Let du denote the 
degree of vertex u. 

Lemma 8. For each u € H, 1/ (10<i„) < Pu ^ l/d„ with probability at least 1 — n~^, 
given that the algorithm succeeds in the first and second rounds. 

Proof. Denote hy N'^{p) the probability that the union of {m} and a p-random set 
has no edge. According to Hoeffding’s inequality, by asking 6>(logn) queries we can 
make 7V"(p„) > 1/e true with probability at least 1 — (l/3)n“^. Note that iV“(p„) < 
(1 — Pu)‘^'^ < Thus we can conclude thatp„ < l/(i„ is true with probability at 

least 1 — (l/3)n“^. 

Assume p“ = max{pi\N^{pi) > 2/5}. First we observe that with high probability 
thatp„ > p“. The probability this inequality is violated is 

Pr[Pu > P]] > Pr[N’^{pj) < 1/5 + l/2e] 

By Hoeffding’s inequality, the probability can be made no more than ( 1 /3)n“^ by asking 
6>(logn) queries. 

According to our choice of p“, we have N'^{pj_^_i) < 2/5. By Lemma 3 we know 
that 

iV“(p“+i) > (1 - 2p“)"“ ■ iV(2p“) 

As we just showed, p„ > p" is true with probability at least 1 — (l/3)n“^. Therefore, 
with probability at least 1 — (l/3)n“^ 

I > iV“(p“+i) > (1 - 2p„)"“ ■ N(2p„) 

> (1 - 2p„d„) • N{2pfi) 

Since we already showed that p„ < 1 /(i„ is true with probability at least 1— (l/3)n“^ 
and we know that Mu G H,du > 2/p*, we have 

iV(2p„) > iV(|-) > iV(p*) > i 

is true with probability at least 1 — (l/3)n“^. Thus we can conclude thatp„ > l/(10<i„) 
with probability at least 1 — (2/3)n“^. □ 

In the third round, we can guarantee that l/(10(i„) < p„ < l/ti„ is true for every 
u G H with probability at least 1 — n“^. 

Let’s assume the above inequality is true for every u G H. Suppose u is a low-degree 
vertex and {u, uj is a non-edge. Let P be a p„-random set. 

Pr{Q{P U {u, u|) = 0} > (1 - p„)‘^“+2/P* • N(p„) 

Since p„ < l/(i„ < p*/2, we have both (1 — = 17(1) and A^(p„) = 17(1). 

The probability that we choose v in one random query is p„ , which is 17( 1 /d„) . Therefore, 
the probability {u, uj is identified in one random query concerning u is 17(l/(i„). By 
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querying the union of {u} and a p„-random set 0{du log n) times, we can guarantee 
that {u, w} is identified as a non-edge with prohahility at least 1 — Therefore, given 
that rounds one, two and three succeed, round four identifies every non-edge {w, f } with 
u G H and v a low degree vertex, with probability at least 1 — 

Given that the algorithm succeeds in rounds one through four, the only pseudo-edges 
that remain are either edges of G or non-edges between pairs of vertices in H or 
non-edges incident with the high degree vertices in L. As shown above, the total number 
of such non-edges is 0{m). 

Finally, we bound the expected number of queries used by the algorithm. It is clear 
that in the event that each round succeeds, the first round uses 0(log^ n) queries; the 
second round uses 0(m log n) queries; the third round uses 0(i/m log^ n) queries; the 
fourth round uses 0( ^ <i„logn) = 0(m log n) queries; the fifth round uses 0{m) 

u&H 

queries. The probability that each round fails is bounded by The maximum number 
of queries used in case of failures is 0{n? log n). Therefore in expectation the algorithm 
uses 0{m\ogn + n) queries. Note that this bound is 0(m log n) if m is 

l7(log^ n). 

Therefore, we have the following theorem. 

Theorem 4. There is a Las Vegas 5-round algorithm that identifies any graph G drawn 
from the class of all graphs with n vertices and m edges using 0{m log n -f yfn log^ n) 
edge-detecting queries in expectation. 



6 Hypergraph Learning 

In this section, we consider the problem of learning hypergraphs with edge-detecting 
queries. An edge-detecting query Qh{S) where H is a hypergraph is answered 1 or 0 
indicating whether S contains all vertices of at least one hyperedge of H or not. The 
information-theoretic lower bound implies that any algorithm takes at least Q{rm log n) 
queries to learn hypergraphs of dimension r with m edges. We show that no algorithm 
can learn hypergraphs of dimension r with m edges using o{{2mlrY^'^) queries if we 
allow the hypergraph to be non-uniform, even if we allow randomness. When m is large, 
say uj{r log^ n), this implies that there is no algorithm using only 0{r log n) queries per 
edge when r > 3. 

For uniform hypergraphs, we show that the algorithm in Section 4 for graphs can be 
generalized to sparse hypergraphs. Flowever, the sparsity requirement for hypergraphs is 
more severe. Recall that we assume d = o(n) in Section 4. For hypergraphs, we require 

d< ^i/(’'-i)/(2ri+2/(’'-i)). 



Theorems. fl((f2mlry edge-detecting queries are required to identify a hyper- 
graph H drawn from the class of all hypergraphs of dimension r with n vertices and m 
edges. 

Proof. We generalize the lower bound argument from [6] for learning monotone DNF 
formulas using membership queries. Let r and k be integers greater than 1 . Let Vi, . . . ,Vr 
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be pairwise disjoint sets containing A: vertices each. Fori < i < rletEi = {(m, r;)|u, w G 
Vi,u ^ v}. Thus, Ei is a clique of 2-edges on the vertices Vi. Consider a hypergraph H 
with vertices V including each Vi and edges 

r 

^ = U ■ • ■ ,Vr}. 

where Wj G Fj for 1 < A < r. There are such hypergraphs, one for each choice of an 
r-edge. 

Even knowing the form of the hypergraph and the identity of the sets of vertices Vi, 
the learning algorithm must ask at least A:'’ — 1 queries if the adversary is adaptive. Every 
query that contains more than one vertex from some Vi is answered 1 ; therefore, only 
queries that contain exactly one vertex from each Vi yield any information about the 
r-edge characterizing E[. An adversary may maintain a set i? C Vi x . . . x consisting 
of the r-edges not queried so far. Each query with an r-edge may be answered 0 until 
\R\ = 1, which means that the learning algorithm must make at least fc’’ — 1 queries to 
learn H. In terms of to, this is (2{{2mlrY^‘^). 

Even if the adversary is constrained to make a random choice of an r-edge T at 
the start of the algorithm and answer consistently with it, we show that l7((2TO/r)’'/^) 
queries are necessary. Suppose 81 , 82 , ■■ ■ , 8 q is the sequence of r-edges a randomized 
algorithm makes queries on. It is easy to see that Pr{ 8 i = T} = 1/fc’’. And also we 
have Pr{ 8 i+i = T\ 8 j Y T, j < i} < 1/(A:’’ — i) since each r-edge is equally likely 
to be T. Therefore, the probability that none of S'^’s equals T is at least (fc’’ — q)/k'^. 
When q < k'^ / 2, this is at least 1/2. □ 

We now present a randomized non-adaptive algorithm for r-uniform hypergraphs with 
bounded degree d, generalizing the algorithm for degree bounded graphs in Section 4. 
The algorithm uses 0{dn log n) queries and succeeds with probability assuming d is 
known and d < The algorithm asks queries on independently 

chosen p-random sets. Let P be a p-random set. Let w be a non-edge of iJ. Thus 
Pr{w C P} = p’’. Consider the set E' of hyperedges that have nonempty intersection 
with w. By uniformity, each such hyperedge contains a vertex that is not in w. Let L be 
a set that contains one such vertex from each hyperedge in E'. Thus \L\ < \E'\ < dr. 
The probability that P includes no edge in E' given that w C P is at least (1 — p) 1^1 < 
(1 — Let H' be the induced hypergraph onV — L — w. Since H' has at most to 
edges, the probability P contains no edge in PI' is at least 1 — mp^. Therefore, we have 

Pr{Q//(P) = 0|ru C P} >p’’(l — p)‘^’'(l — mp^) 

>p’’(l — drp){l — mpY 

Choose p = l/(2dn/r)^/’’. Since mr < dn, 1 — mp'~ > 1/2. When d < 
1 — drp > 1/2. Therefore, the above probability is at least 
r / { 8 dn) . The probability that w is not discovered to be a non-edge after 8c?n(r-|- 1) /r In n 
queries is at most The probability that some non-edge in PI is not discovered 

after this many queries is bounded by We can decrease this probability to n“'^ by 
asking c times more queries. 
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Theorem 6. There is a Monte Carlo non-adaptive algorithm that identifies any 
graph G drawn from the class of all graphs bounded degree d, where d < 
with probability at least 1 — n~‘^ using 0{dnlogn) queries, 
where n is the number of vertices and c is some constant. 



7 Open Problems 

We leave the following problems open. Reduce the number of queries needed for Algo- 
rithm 2 from 0(m log n + ^/rn\og^ n) to O(mlogn). Reduce the number of rounds of 
Algorithm 2 without substantially increasing the number of queries. Find an algorithm 
that learns the class of r-uniform hypergraphs with m edges using 0{rm log n) queries 
or show it is impossible. 
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Abstract. We consider two well-studied problems regarding attribute 
efficient learning: learning decision lists and learning parity functions. 
First, we give an algorithm for learning decision lists of length k over 
n variables using 2^^*° ^ ^ logn examples and time ^ \ This is the 
first algorithm for learning decision lists that has both subexponential 
sample complexity and subexponential running time in the relevant pa- 
rameters. Our approach is based on a new construction of low degree, 
low weight polynomial threshold functions for decision lists. For a wide 
range of parameters our construction matches a lower bound due to 
Beigel for decision lists and gives an essentially optimal tradeoff between 
polynomial threshold function degree and weight. 

Second, we give an algorithm for learning an unknown parity function 
on k out of n variables using examples in poly(n) time. For 

k = o(logn) this yields the first polynomial time algorithm for learning 
parity on a superconstant number of variables with sublinear sample 
complexity. We also give a simple algorithm for learning an unknown 
size-fc parity using O(fclogn) examples in time, which improves on 
the naive time bound of exhaustive search. 



1 Introduction 

An important goal in machine learning theory is to design attribute ejficient 
algorithms for learning various classes of Boolean functions. A class C of Boolean 
functions over n variables xi,. . . ,Xn is said to be attribute- efficiently learnable 
if there is a poly(n) time algorithm which can learn any function f G C using 
a number of examples which is polynomial in the “size” (description length) 
of the function / to be learned, rather than in n, the number of features in 
the domain over which learning takes place. (Note that the running time of 
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the learning algorithm must in general be at least n since each example is an 
n-bit vector.) Thus an attribute efficient learning algorithm for e.g. the class of 
Boolean conjunctions must be able to learn any Boolean conjunction of k literals 
over xi,. . . ,Xn using poly(/c, log n) examples, since klogn bits are required to 
specify such a conjunction. 

A longstanding open problem in machine learning, posed first by Blum in 
1990 [4,5, 7, 8] and again by Valiant in 1998 [33], is whether or not there exist 
attribute efficient algorithms for learning decision lists, which are essentially 
nested “if-then-else” statements (we give a precise definition in Section 2) . One 
motivation for considering the problem comes from the infinite attribute model 
introduced in [4] . Blum et al. [7] showed that for many concept classes (including 
decision lists) attribute efficient learnability in the standard n-attribute model is 
equivalent to learnability in the infinite attribute model. Since simple classes such 
as disjunctions and conjunctions are attribute efficiently learnable (and hence 
learnable in the infinite attribute model) , this motivated Blum [4] to ask whether 
the richer class of decision lists is thus learnable as well. Several researchers [5,8, 
10,26,29] have since considered this problem; we summarize this previous work 
in Section 1.2. More recently. Valiant [33] relates the problem of learning decision 
lists attribute efficiently to questions about human learning abilities. 

Another outstanding challenge in machine learning is to determine whether 
there exist attribute efficient algorithms for learning parity functions. The par- 
ity function on a set of 0/1-valued variables Xij, . . . ,Xi,. takes value -1-1 or — 1 
depending on whether -I- • ■ • -I- xi^. is even or odd. As with decision lists, a 
simple PAC learning algorithm is known for the class of parity functions but no 
attribute efficient algorithm is known. 

1.1 Our Results 

We give the first learning algorithm for decision lists that is subexponential in 
both sample complexity (in the relevant parameters k and log n) and running 
time (in the relevant parameter k) . Our results demonstrate for the first time that 
it is possible to simultaneously avoid the “worst case” in both sample complexity 
and running time, and thus suggest that it may perhaps be possible to learn 
decision lists attribute efficiently. Our main learning result for decision lists is: 

Theorem 1. There is an algorithm which learns length-k decision lists over 
{0,1}” with mistake bound 2'^(^^^^Uogn and time 

This bound improves on the sample complexity of Littlestone’s well-known 
Winnow algorithm [21] for all k and improves on its runtime as well for 
k = i7(log^^^n); see Section 1.2. 

We prove Theorem 1 in two parts; first we generalize the Winnow algorithm 
for learning linear threshold functions to learn polynomial threshold functions 
(PTFs). In recent work on learning DNF formulas [18], intersections of halfs- 
paces [17], and Boolean formulas of superconstant depth [27], PTFs of degree d 
have been learned in time by using polynomial time linear programming 
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algorithms such as the Ellipsoid algorithm (see e.g. [18]). In contrast, since we 
want to achieve low sample complexity as well as an runtime, we use a 

generalization of the Winnow algorithm to learn PTEs. This generalization has 
sample complexity and running time bounds which depend on the degree and 
the total magnitude of the integer coefficients (i.e. the weight) of the PTE: 

Theorem 2. LetC he a class of Boolean functions over {0, 1}” with the property 
that each f € C has a PTF of degree at most d and weight at most W. Then 
there is an online learning algorithm for C which runs in n‘^ time per example 
and has mistake hound 0(1T^ • d ■ logn). 

This reduces the decision list learning problem to a problem of representing 
decision lists with PTEs of low weight and low degree. To this end we prove: 

Theorem 3. Let L he a decision list of length k. Then L is computed hy a 
polynomial threshold function of degree 0{kf^^) and weight 2^^^ 

Theorem 1 follows directly from Theorems 2 and 3. We emphasize that The- 
orem 3 does not follow from previous results [18] on representing DNF formulas 
as PTEs; the PTF construction from [18] in fact has exponentially larger weight 

(2^°*' * rather than 2®^^ ^^^) than the construction in this paper. 

Our PTF construction is essentially optimal in the tradeoff between degree 
and weight which it achieves. In 1994 Beigel [3] gave a lower bound showing that 
any degree d PTF for a certain decision list must have weight \ ^ For 

d = Beigel’s lower bound implies that our construction in Theorem 3 is 

essentially the best possible. 

For parity functions, we give an 0{n‘^) time algorithm which can PAG learn 
an unknown parity on k variables out of n using examples. To our 

knowledge this is the first algorithm for learning parity on a superconstant num- 
ber of variables with sublinear sample complexity. Our algorithm works by 
finding a “low weight” solution to a system of m linear equations (correspond- 
ing to a set of m examples). We prove that with high probability we can find 
a solution of weight 0(n^“^/^) irrespective of m. Thus by taking m to be only 
slightly larger than standard arguments show that our solution is a good 

hypothesis. 

We also describe a simple algorithm, due to Dan Spielman, for learning an 
unknown parity on k variables using O(fclogn) examples and time. This 

gives a square root runtime improvement over a naive 0(n^) exhaustive search. 



1.2 Previous Results 

In previous work several algorithms with different performance bounds (runtime 
and sample complexity) have been given for learning length-fc decision lists. 

^ Krause [20] claims a lower bound of degree d and weight for a particular 

decision list; this claim, however, is in error. 
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— Rivest [28] gave the first algorithm for learning decision lists in Valiant’s 
PAC model of learning from random examples. Littlestone [5] later gave an 
analogue of Rivest’s algorithm in the online learning model. The algorithm 
can learn any decision list of length k in 0{kn^) time using 0{kn) examples. 

— A brute-force approach is to maintain the set of all length-fc decision lists 
which are consistent with the examples seen so far, and to predict at each 
stage using majority vote over the surviving hypotheses. This “halving al- 
gorithm” (proposed in various forms in [1,2,24]) can learn decision lists of 
length k using only 0{klogn) examples, but the running time is 

— Several researchers [5,33] have observed that Winnow can learn length-fc 
decision lists from 2‘^^^Uogn examples in time 2'^^^)nlogn. This follows 
from the fact that any decision list of length k can be expressed as a linear 
threshold function with integer coefficients of magnitude 

— Finally, several researchers have considered the special case of learning a 
length-fc decision list in which the output bits of the list have at most D 
alternations. Valiant [33] and Nevo and El-Yaniv [26] have given refined 
analyses of Winnow’s performance for this case (see also Dhagat and Heller- 
stein [10]). However, for the general case where D can be as large as k, these 
results do not improve on the standard Winnow analysis described above. 

Note that all of these earlier algorithms have an exponential dependence on the 
relevant parameter(s) {k and logn for sample complexity, k for running time) 
for either the running time or the sample complexity. 

Little previous work has been published on learning parity functions attribute 
efficiently in the PAC model. The standard PAC learning algorithm for parity 
(based on solving a system of linear equations) is due to Helmbold et al. [15]; 
however this algorithm is not attribute efficient since it uses Q{n) examples 
regardless of k. Several authors have considered learning parity attribute effi- 
ciently in a model where the learner is allowed to make membership queries. 
Attribute efficient learning is easier in this framework since membership queries 
can help identify relevant variables. Blum et al. [7] give a randomized polynomial 
time membership-query algorithm for learning parity on k variables using only 
O(fclogn) examples, and these results were later refined by Uehara et al. [32]. 

In Section 2 we give necessary background. In Section 3 we show how to 
reduce the decision list learning problem to a problem of finding suitable PTF 
representations of decision lists (Theorem 2). In Section 4 we give our PTF con- 
struction for decision lists (Theorem 3). In Section 5 we discuss the connection 
between Theorem 3 and Beigel’s ODDMAXBIT lower bound. In Section 6 we 
give our results on learning parity functions, and we conclude in Section 7. 

2 Preliminaries 

Attribute efficient learning has been chiefly studied in the on-line mistake-hound 
model of concept learning which was introduced in [21,23]. In this model learning 
proceeds in a series of trials, where in each trial the learner is given an unlabelled 
boolean example x G {0,1}" and must predict the value f{x) of the unknown 




228 



A.R. Klivans and R.A. Servedio 



target function /. After each prediction the learner is given the true value of f{x) 
and can update its hypothesis before the next trial begins. The mistake hound of 
a learning algorithm on a target concept c is the worst-case number of mistakes 
that the algorithm makes over all (possibly infinite) sequences of examples, and 
the mistake bound of a learning algorithm on a concept class (class of Boolean 
functions) C is the worst-case mistake bound across all functions f G C. The 
running time of a learning algorithm A for a concept class C is defined as the 
product of the mistake bound of A on (7 times the maximum running time 
required by A to evaluate its hypothesis and update its hypothesis in any trial. 

Our main interests are the classes of decision lists and parity functions. A 
decision list L of length k over the Boolean variables xi,. . . ,Xn is represented 
by a list of k pairs and a bit {£i, bi), {£ 2 , & 2 )> • ■ • , {£k, bk), bk+i where each £i is a 
literal and each bi is either —1 or 1. Given any x G {0,1}", the value of L(x) 
is bi if i is the smallest index such that £i is made true by x; if no £i is true 
then L{x) = bk+i- A parity function of length k is defined by a set of variables 
S C {xi, . . . ,Xn} such that [S'! = k. The parity function xs{x) takes value 1 
(— 1) on inputs which set an even (odd) number of variables in A to 1. 

Given a concept class C over (0, 1}" and a Boolean function f G C, let size(/) 
denote the description length of / under some reasonable encoding scheme. We 
say that a learning algorithm A for C in the mistake-bound model is attribute- 
efficient if the mistake bound of A on any concept f G C is polynomial in 
size(/). In particular, the description length of a length k decision list (parity) is 
O(fclogn), and thus we would ideally like to have poly(n)-time algorithms which 
learn decision lists (parities) of length k with a mistake bound of poly (A:, log n). 

(We note here that attribute efficiency has also been studied in other learn- 
ing models, namely Valiant’s Probably Approximately Gorrect (PAG) model of 
learning from random examples. Standard conversion techniques are known [1, 
14,22] which can be used to transform any mistake bound algorithm into a PAG 
learning algorithm. These transformations essentially preserve the running time 
of the mistake bound algorithm, and the sample size required by the PAG algo- 
rithm is essentially the mistake bound. Thus, positive results for mistake bound 
learning, such as those we give for decision lists in this paper, directly yield 
corresponding positive results for the PAG model.) 

Finally, our results for decision lists are achieved by a careful analysis of poly- 
nomial threshold functions. Let / be a Boolean function / : {0, 1}" -G { — 1, 1} 
and let p be a polynomial in n variables with integer coefficients. Let d denote 
the degree of p and let W denote the sum of the absolute values of p’s integer 
coefficients. If the sign of p{x) equals f{x) for every x G (0, 1}", then we say 
that p is a polynomial threshold function (PTF) of degree d and weight W for /. 

3 Expanded-Winnow: Learning Polynomial Threshold 
Functions 

Littlestone [21] introduced the online Winnow algorithm and showed that it can 
attribute efficiently learn Boolean conjunctions, disjunctions, and low weight 
linear threshold functions. Throughout its execution Winnow maintains a linear 
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threshold function as its hypothesis; at the heart of the algorithm is an update 
rule which makes a multiplicative update to each coefficient of the hypothesis 
each time a mistake is made. Since its introduction Winnow has been intensively 
studied from both applied and theoretical standpoints (see e.g. [6,12,16,30]). 

The following theorem (which, as noted in [33], is implicit in Littlestone’s 
analysis in [21]) gives a mistake bound for Winnow for linear threshold functions: 

Theorem 4. Let f{x) he the linear threshold function signif^^^^ WiXi — 9) over 
inputs X € {0,1}" where 9 and w\, . . . ,Wn are integers. Let W = 1^*1- 

Then Winnow learns f{x) with mistake hound 0(W^logn), and uses n time 
steps per example. 

We will use a generalization of the Winnow algorithm, which we call 
Expanded- Winnow, to learn polynomial threshold functions of degree at most d. 
Our generalization introduces X)f=i (d) variables (one for each monomial of 
degree up to d) and runs Winnow to learn a linear threshold function over these 
new variables. More precisely, in each trial we convert the n-bit received example 
X = {xi, . . . ,Xn) into a (d) bit expanded example (where the bits in the 

expanded example correspond to monomials over xi, . . . ,x„), and we give the 
expanded example to Winnow. Thus the hypothesis which Winnow maintains 
- a linear threshold function over the space of expanded features - is a poly- 
nomial threshold function of degree d over the original n variables Xi, . . . ,x„. 
Theorem 2, which follows directly from Theorem 4, summarizes the performance 
of Expanded-Winnow: 

Theorem 2 Let C he a class of Boolean functions over (0, 1}" with the property 
that each f € C has a polynomial threshold function of degree at most d and 
weight at most W. Then Expanded-Winnow algorithm runs in n‘^ time per 
example and has mistake hound 0{W^ ■ d ■ logn) for C. 

Theorem 2 shows that the degree of a polynomial threshold function strongly 
affects Expanded- Winnow’s running time, and the weight of a polynomial thresh- 
old function strongly affects its sample complexity. 

4 Constructing PTFs for Decision Lists 



In previous constructions of polynomial threshold functions for computational 
learning theory applications [18,17,27] the sole goal has been to minimize the de- 
gree of the polynomials regardless of the size of the coefficients. As one example, 
the construction of [18] of 0{W/^) degree PTFs for DNF formulae yields poly- 
nomials whose coefficients can be douhly exponential in the degree. In contrast, 
we must now construct PTFs that have low degree and low weight. 

We give two constructions of PTFs for decision lists, each of which has rela- 
tively low degree and relatively low weight. We then combine these to achieve 
an optimal construction with improved bounds on both degree and weight. 
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4.1 Outer Construction 

Let L be a decision list of length k over variables xi,. . . ,Xk- We first give a 
simple construction of a degree h, weight PTF for L which is based 

on breaking the list L into sublists. We call this construction the “outer con- 
struction” since we will ultimately combine this construction with a different 
construction for the “inner” sublists. 

We begin by showing that L can be expressed as a threshold of modified 
decision lists which we now define. The set Bh of modified decision lists is defined 
as follows: each function in Bh is a decision list (^i, 6i), (£2, ^2), • • • , (^?i, &/i), 0 
where each £i is some literal over xi , . . . and each bi G {—1, 1}. Thus the 
only difference between a modified decision list f G Bh and a normal decision 
list of length h is that the final output value is 0 rather than bh+i G {— 

Without loss of generality we may suppose that the list L is 
(a;i, 61 ), . . . , {xk, bk),bk+i- We break L sequentially into k/h blocks each of length 
h. Let fi G Bh be the modified decision list which corresponds to the z-th block 
of L, i.e. Ji is the list {x(,_i)h+i,b(^i-i)h+i), ■ ■ ■ ,{x(^+l)h,h(i+l)h),Q■ Intuitively 
fi computes the zth block of L and equals 0 only if we “fall of the edge” of the 
zth block. We then have the following straightforward claim: 

Claim. The decision list L is eqivalent to 




Proof. Given an input x 0^ let r = (i — l)h + che the first index such that Xr 
is satisfied. It is easy to see that fj{x) = 0 for j < z and hence the value in (1) 
is + J2'j=i+i fj{x) + 6fc+i, the sign of which is easily seen 

to be br. Finally if a; = 0^ then the argument to (1) is bt+i. □ 

Note: It is easily seen that we can replace the 2 in formula (1) by a 3; this will 
prove useful later. 

As an aside, note that Claim 4.1 can already be used to obtain a tradeoff 
between running time and sample complexity for learning decision lists. The 
class Bh contains at most (4n)^ functions. Thus as in Section 3 it is possible 
to run the Winnow algorithm using the functions in Bh as the base features for 
Winnow. (So for each example x which it receives, the algorithm would first 
compute the value of f{x) for each f £ Bh, and would then use this vector 
of {f{x))f^Bh values as the example point for Winnow.) A direct analogue of 
Theorem 2 now implies that Expanded- Winnow (run over this expanded feature 
space of functions from Bh) can be used to learn Lk in time with 

mistake bound 2‘^^^/^^hlogn. 

However, it will be more useful for us to obtain a PTF for L. We can do this 
from Claim 4.1 as follows: 

Theorem 5. Let L he a decision list of length k. For any h < k we have that L 
is computed by a polynomial threshold function of degree h and weight 4 • 2^/^+^ . 
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Proof. Consider the first modified decision list fi = 

{£i,bi),{£2,b2), ■ ■ ■ ,{ifi,bh ),0 in the expression (1). For £ a literal let £ 
denote cc if £ is an unnegated variable x and let £ denote 1 — a; if if fi is a negated 
variable x. We have that for all x G {0, 1}^, /i(a;) is computed exactly by the 
polynomial 

fl{x) = £\bi + (1 — ^l)^2&2 + (1 — — ■^ 2 )^ 3 ^ 3 H h (1 — ^i) • ■ • {l — £h-l)£hbh- 

This polynomial has degree h and has weight at most 2^+^. Summing these 
polynomial representations for fi, ■ ■ ■ , fk/h as in (1) we see that the resulting 
PTF given by (1) has degree h and weight at most 2^/^+^ • 2^+^ = 4-2^/^+^. □ 

Specializing to the case h = Vk we obtain: 

Corollary 1 . Let L he a decision list of length k. Then L is computed by a 
polynomial threshold function of degree kf^'^ and weight 4 • 2^^ ^ . 

We close this section by observing that an intermediate result of [18] can be 
used to give an alternate proof of Corollary 1 with slightly weaker parameters; 
however our later proofs require the construction given in this section. 

4.2 Inner Approximator 

In this section we construct low degree, low weight polynomials which approxi- 
mate (in the Lqo norm) the modified decision lists from the previous subsection. 
Moreover, the polynomials we construct are exactly correct on inputs which “fall 
off the end”: 

Theorem 6. Let f € Bh be a modified decision list of length h (without loss 
of generality we may assume that f is (xi,bi), . . . , (xh,bh), 0 ). Then there is a 
degree 2 ^/h\ogh polynomial p such that 

— for every input x G {0, 1}^ we have \p{x) — f{x)\ < 1 /h. 

- p(0'^) = /(O'*) = 0. 

Proof. As in the proof of Theorem 5 we have that 

f{x) = bixi + 62(1 - a;i)a;2 H h 6^(1 - xi) • • • (1 - Xh-i)xh- 

We will construct a lower (roughly vTi) degree polynomial which closely approx- 
imates /. Let Ti denote (1 — x\) ... (1 — Xi_i)xi, so we can rewrite / as 

f{x) = biTi + 6222 + • • ■ + bhTk. 

We approximate each Ti separately as follows: set Ai{x) = h — i + Xi + 
— Xj). Note that for x G {0, 1}'*, we have T^(x) = 1 iff A^(x) = h and 
Ti(x) = 0 iff 0 < Ai{x) <h—l. Now define the polynomial 

Qi{x) = q{Ai{x)/h) where q{y) = Cdiy {I + l/h)) . 

As in [18], here Cdix) is the dth Chebyshev polynomial of the first kind (a 
univariate polynomial of degree d) with d set to [ \/li \ . We will need the following 
facts about Chebyshev polynomials [9]: 
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— \Cd{x)\ < 1 for |a:| < 1 with Cd{^) = 1; 

— C'^{x) > for a: > 1 with C'^(l) = (P ■ 

— The coefficients of Cd are integers each of whose magnitude is at most 2'^. 

These first two facts imply that < 7 ( 1 ) > 2 but |g(j/)| < 1 for y G [0, 1 — ^]. We 
thus have that Qi{x) = q{l) > 2 if Ti{x) = 1 and \Qi{x)\ < 1 if Ti{x) = 0. 

Now define Pi{x) = ( ^‘( 1 ) ) ■ This polynomial is easily seen to be a good 

approximator for if cc G {0, 1}^ is such that Ti{x) = 1 then Pi{x) = 1, and if 
X G {0, 1}^ is such that Ti{x) = 0 then \Pi{x)\ < 

Now define R{x) = Y^\=i^iPi(x) and p{x) = R{x) — i?(0^). It is clear 
that p(O^) = 0. We will show that for every input 0^ yf a: G {0, 1}^ we have 
|p(a:) — f{x)\ < 1/h. Fix some such x; let i be the first index such that Xi = 1. 
As shown above we have Pi{x) = 1. Moreover, by inspection of Tj{x) we have 
that Tj{x) = 0 for all j yf i, and hence \Pj{x)\ < Consequently the value of 
R{x) must lie in [bi — ^^,bi + Since f{x) = bi we have that p{x) is an Lao 
approximator for f{x) as desired. 

Finally, it is straightforward to verify that p{x) has the claimed degree. □ 

Strictly speaking we cannot discuss the weight of the polynomial p since its 
coefficients are rational numbers but not integers. However, by multiplying p 
by a suitable integer (clearing denominators) we obtain an integer polynomial 
with essentially the same properties. Using the third fact about Chebyshev poly- 
nomials from our proof above, we have that q{l) is a rational number N 1 /N 2 
where Ni,N 2 are each integers of magnitude Each Qi{x) for i = 1, . . . ,h 

can be written as an integer polynomial (of weight divided by 

Thus each Pi{x) can be written as Pj(x)/(/i'^Ni)^^°s^ where Pi (x) is an inte- 
ger polynomial of weight follows that p{x) equals p(x)/C, where 

C is an integer which is at most 2*^^^ ' and p is a polynomial with integer 

coefficients and weight . We thus have 

Corollary 2. Let f € Bh be a modified decision list of length h. Then there is 
an integer polynomial p{x) of degree 2y/h\ogh and weight 2*^^^ ^ and an 

integer C = such that 

— for every input x G {0, 1}^ we have \p{x) — Cf{x)\ < Cfh. 

— p(O^) = /(O'*) = 0. 

The fact that p(0'*) is exactly 0 will be important in the next subsection 
when we combine the inner approximator with the outer construction. 

4.3 Composing the Constructions 

In this section we combine the two constructions from the previous subsections 
to obtain our main polynomial threshold construction: 
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Theorem 7. Let L he a decision list of length k. Then for any h < k, L is 
computed by a polynomial threshold function of degree 0{h^^^ log h) and weight 

20(fc/?l+hT2 

Proof. We suppose without loss of generality that L is the decision list 
(a;i, 6i), . . . , (xfe, bk), bk+i- We begin with the outer construction: from the note 
following Claim 4.1 we have that 



where C is the value from Corollary 2 and each fi is a modified decision list of 
length h computing the restriction of L to its tth block as defined in Subsection 
4.1. Now we use the inner approximator to replace each Cfi above by pi, the 
approximating polynomial from Corollary 2, i.e. consider sign(iJ(x)) where 

kjh 

H{x) = + Cbk+i. 

i=l 

We will show that sign(iL(a;)) is a PTF which computes L correctly and has the 
desired degree and weight. 

Fix any x G {0, 1}^. If a; = 0* then by Corollary 2 each Pi{x) is 0 so H{x) = 
Cbk+i has the right sign. Now suppose that r = (i — l)/i + c is the first index 
such that Xr = 1. By Corollary 2, we have that 

— = 0 for j < i; 

— 3'=/'‘-*+V(a;) differs from by at most (73'=/'*"*+^ • d; 

— The magnitude of each value is at most C'3*/^“^“''^(l + for 

j > i. 

Combining these bounds, the value of H{x) differs from by at most 

) [ 3 -- 

which is easily seen to be less than (j^k/h-i+i magnitude. Thus the sign of 
H{x) equals br, and consequently sign{H{x)) is a valid polynomial threshold 
representation for L{x). Finally, our degree and weight bounds from Corollary 
2 imply that the degree of H{x) is 0{h}^'^\ogh) and the weight of H{x) is 
20 (fc/ft,)+ 0 (/iT 2 iog 2 theorem is proved. □ 

Taking h = k'^/^/log^/^k in the above theorem we obtain our main result on 
representing decision lists as polynomial threshold functions: 

Theorem 3 Let L he a decision list of length k. Then L is computed by a 
polynomial threshold function of degree k^^^log^^^ k and weight 2^^^ 



+ 3'=/'*-*-! + ••• + 3] 



+ 1 



C 









L{x) = sign I C 


kjh 

+ 6fe+i 


V 





Theorem 3 immediately implies that Expanded-Winnow can learn decision 
lists of length k using 2*^^^ logn examples and time 
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4.4 Application to Learning Decision Trees 

In 1989 Ehrenfeucht and Haussler [11] gave an a time algorithm for 

learning decision trees of size s over n variables. Their algorithm uses 
examples, and they asked if the sample complexity could be reduced to poly(n, s). 
We can apply our techniques here to give an algorithm using 2^*^® ^ ^ log n ex- 
amples, if we are willing to spend time: 

Theorem 8. Let D he a decision tree of size s over n variables. Then D can he 
learned with mistake bound 2^^^ ^ logn in time \ 

The proof is omitted because of space limitations in these proceedings. 

5 Lower Bounds for Decision Lists 

Here we observe that our construction from Theorem 7 is essentially optimal in 
terms of the tradeoff it achieves between polynomial threshold function degree 
and weight. 

In [3], Beigel constructs an oracle separating PP from P'^^. At the heart of 
his construction is a proof that any low degree PTF for a particular decision list, 
called the ODDMAXBIT„ function, must have large weights: 

Definition 1. The ODDMAXBIT„ function on input x = X\, . . . ,Xn G {0, 1}" 
equals (—1)* where i is the index of the first nonzero hit in x. 

It is clear that the ODDMAXBIT„ function is equivalent to a decision list 
(a;i, — 1), (x 2 , 1), ( 3 : 3 , —1), . . . , {xn, (—1)”), (— l)"'*'^ of length n. The main tech- 
nical theorem which Beigel proves in [3] states that any polynomial threshold 
function of degree d computing ODDMAXBIT„ must have weight 

Theorem 9. Let p he a degree d PTF with integer coefficients which computes 
ODDMAXBITn . Then w = where w is the weight of p. 

(As stated in [3] the bound is actually w > '> where s is the number of 

nonzero coefficients in p. Since s < w this implies the result as stated above.) 

A lower bound of 2^^") on the weight of any linear threshold function {d = 1) 
for ODDMAXBITn has long been known [25]; Beigel’s proof generalizes this 
lower bound to all d = 0(n^/^). A matching upper bound of 2'^^”) on weight for 
d = \ has also long been known [25] . Our Theorem 7 gives an upper bound which 
matches Beigel’s lower bound (up to logarithmic factors) for all d = 0(n^/^): 

Observation 10 For any d = 0(n^/^) there is a polynomial threshold function 
of degree d and weight 2'^^”/®*^) which computes ODDMAXBIT„ . 



Proof. Set d = h^^"^ log h in Theorem 7. The weight bound given by Theorem 7 
is 2*^*' ^ hdiogd) jg for d = 0(n^/^). □ 
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Note that since the ODDMAXBIT„ function has a polynomial size DNF, 
Beigel’s lower bound gives a polynomial size DNF / such that any degree 
polynomial threshold function for / must have weight This suggests 

that the Expanded-Winnow algorithm cannot learn polynomial size DNF in 
20 (n from 2" examples for any e > 0, and thus suggests that im- 

proving the sample complexity of the DNF learning algorithm from [18] while 
maintaining its 2*^^" ^ running time may be difficult. 

6 Learning Parity Functions 

6.1 A Polynomial Time Algorithm 

Recall that the standard algorithm for learning parity functions works by viewing 
a set of m labelled examples as a set of m linear equations over GF(2). Gaussian 
elimination is used to solve the system and thus find a consistent parity. Even 
though there exists a solution of weight at most k (since the target parity is of size 
k), Gaussian elimination applied to a system of m equations in n variables over 
GF{2) may yield a solution of weight as large as min(m,n). Thus this standard 
algorithm and analysis give an 0(ri) sample complexity bound for learning a 
parity of length at most k. 

We now describe a simple poly(n)-time algorithm for PAG learning an un- 
known size-fc parity using examples. As far as we know this is the 

first improvement on the standard algorithm and analysis described above. 

Theorem 11. The class of all parity functions on at most k variables is PAC 
learnahle in 0{n‘^) time using log n) examples. The hypothesis output 

by the learning algorithm is a parity function on variables. 

Proof. If fc = l7(log n) then the standard algorithm suffices to prove the claimed 
bound. We thus assume that k = o(logn). 

Let H be the set of all parity functions of size at most Note that 

\H\ < n" ^ so log \H\ < n^“^/^logn. Gonsider the following algorithm: 

1. Ghoose m = l/e(log|il| -I- log(l/5)) examples. Express each example as a 
linear equation over n variables mod 2 as described above. 

2. Randomly choose a set of n — variables and assign them the value 0. 

3. Use Gaussian elimination to attempt to solve the resulting system of equa- 
tions on the remaining variables. If the system has a solution, output 

the corresponding parity (of size at most as the hypothesis. If the 

system has no solution, output “FAIL.” 

If the simplified system of equations has a solution, then by a standard Oc- 
cam’s Razor argument this solution is a good hypothesis. We will show that the 
simplified system has a solution with probability I7(l/n). The theorem follows 
by repeating steps 2 and 3 of the above algorithm until a solution is found (an 
expected 0{n) repetitions will suffice). 
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Let V be the set of k relevant variables on which the unknown parity function 
depends. It is easy to see that as long as no variable in V is assigned a 0, the 
resulting simplified system of equations will have a solution. Let £ = The 

probability that in Step 2 the n—£ variables chosen do not include any variables in 
V is exactly (”Z^) / (”) which equals ("Zfe)/(”)- Expanding binomial coefficients 
we have 

(") lYn-k + i \n-k) \n) 




which proves the theorem. □ 

6.2 An Time Attribute Efficient Algorithm 

Dan Spielman [31] has observed that it is possible to improve on the time 
bound of a naive search algorithm for learning parity using k log n examples: 

Theorem 12 (Spielman). The class of all size-k parity functions is PAC 
learnable in time from 0{k\ogn) examples, using size-k parities as the 

hypothesis class. 

Proof. By Occam’s Razor we need only show that given a set of m = 0(A:logn) 
labelled examples, a consistent size-fc parity can be found in time. 

Given a labelled example {x\, . . . , a;„; y) we will view y as an (n + l)st at- 
tribute Xn+i. Thus our task is to find a set of (fc -|- 1) attributes Xjj, . . . 
one of which must be Xn+i, which sum to 0 in every example in the sample. 

Let yi), . . . (x™; ym) be the labelled examples in our sample. Given a sub- 
set S of variables, let vs denote the length-m binary vector (xs(a^^)) • ■ • , Xs(2^™)) 
obtained by computing the parity function xs on each example in our sample. 

We construct two lists, each containing (^” 2 ) vectors of length m. The first 
list contains all the vectors vs where S ranges over all fc/2-element subsets of 
{x\, . . . ,Xn}. The second list contains all the vectors wsu{a;„+i} where S again 
ranges over all fc/2-element subsets of {x\, . . . , a;„}. 

After sorting these two lists of vectors, which takes time, we scan 

through them in parallel in time linear in the length of the lists and find a pair 
of vectors vsi from the first list and vs 2 u{x„+i} from the second list which are 
the same. (Note that any decomposition of the target parity into two subsets S'! 
and S 2 of k/2 variables each will give such a pair). The set U S 2 is then a 
consistent parity of size k. □ 

7 Future Work 

An obvious goal for future work is to improve our algorithmic results for learn- 
ing decision lists. As a first step, one might attempt to extend the tradeoffs 
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we achieve: is it possible to learn decision lists of length k in n time from 
poly (/c, log n) examples? 

Another goal is to extend our results for decision lists to broader concept 
classes. In particular, it would be interesting to obtain analogues of our algorith- 
mic results for learning general linear threshold functions (independent of their 
weight). We note here that Goldmann et al. [13] have given a linear threshold 
function over {—1, 1}” for which any polynomial threshold function must have 
weight regardless of its degree. Moreover Krause and Pudlak [19] have 

shown that any Boolean function which has a polynomial threshold function over 
{0, 1}" of weight w has a polynomial threshold function over {—1, 1}" of weight 
n^w'^. These results imply that representational results akin to Theorem 3 for 
general linear threshold functions must be quantitatively weaker than Theorem 
3; in particular, there is a linear threshold function over {0, 1}” with k nonzero 
coefficients for which any polynomial threshold function, regardless of degree, 
must have weight 

For parity functions many questions remain as well: can we learn parity 
functions on fc = ©(log n) variables in polynomial time using a sublinear number 
of examples? Can we learn size-fc parities in polynomial time using fewer than 
ni-i/k examples? Can we learn size-k parities from O(fclogn) examples in time 
(0(^fe/3)? pj-ogress on any of these fronts would be quite interesting. 
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Abstract. We consider the problem of learning on a compact metric 
space A in a functional analytic framework. For a dense subalgebra of 
Lip{X), the space of all Lipschitz functions on X, the Representer The- 
orem is derived. We obtain exact solutions in the case of least square 
minimization and regularization and suggest an approximate solution 
for the Lipschitz classifier. 



1 Introduction 

One important direction of current machine learning research is the general- 
ization of the Support Vector Machine paradigm to handle the case where the 
input space is an arbitrary metric space. One such generalization method was 
suggested recently in [2] , [5] : we embed the input space X into a Banach space 
E and the hypothesis space of decisions functions on X into the dual space E* 
of linear functionals on E. In [5], the hypothesis space is Lip{X), the space 
of all bounded Lipschitz functions on X. The input space X itself is embedded 
into the space AE{Xq) of molecules on Xq, which up to isometry, is the largest 
Banach space that X embeds into isometrically [6]. 

The Representer Theorem, which is essential in the formulation of the solu- 
tions of Support Vector Machines, was, however, not achieved in [2]. In order to 
obtain this theorem, it is necessary to restrict ourselves to subspaces of Lip{X) 
consisting of functions of a given explicit form. In this paper, we introduce a 
general method for deriving the Representer Theorem and apply it to a dense 
subalgebra of Lip{X). We then use the theorem to solve a problem of least 
square minimization and regularization on the subalgebra under consideration. 
Our approach can be considered as a generalization of the Lagrange polynomial 
interpolation formulation. It is substantially different from that in [5], which 
gives solutions that are minimal Lipschitz extensions (section 6.1). 

Throughout the paper, (A, d) will denote a compact metric space and S = 
C (A X K)” a sample of length n. 
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1.1 The Representer Theorem 

The Representer Theorem is not magic, neither is it an exclusive property of 
Support Vector Machines and Reproducing Kernel Hilbert Spaces. It is a direct 
consequence of the fact that our training data is finite. A general method to 
derive the Representer Theorem is as follows. Let .7^ be a normed space of real- 
valued functions on X . Consider the evaluation operator 

( 1 ) 

defined by 

= ,f{Xn)) (2) 

Consider the problem of minimizing the following functional over X: 

n 

Is{f) = Y.^{f{xi),y^) ( 3 ) 

i=l 

with V being a convex, lower semicontinuous loss function. Let ker(As) denote 
the kernel of the map Ag, defined by 

ker{As) = {f G X : As{f) = {f{xi), , /(x„)) = (0, . . . , 0)} (4) 

Clearly, the problem of minimizing 1$ over T is equivalent to minimizing Is over 
the quotient space T j ker{As) , which being isomorphic to the image Im{As) C 
M”, is finite dimensional. Let Tn be the complementary subspace of ker{As) 

T = Tn® ker{As) (5) 

that is a linear subspace of X such that Xn H ker{As) = {0} and every j G T 
admits a unique decomposition 



f = fu + r ( 6 ) 

where f„ G Tn and r G ker(As). Clearly we have f — fn G ker(As). Consider 
the equivalent relation on the quotient space T /ker{As) defined by 

/ - /o ^ / G [/o] ^ Asf = Asfo ^ / - /o G ker{As) (7) 

Thus / ~ /o iff they have the same projection onto Hence T jkeri^A s') = Tn 

via the identification. 



[/] ^ In ( 8 ) 

We are led to the following fundamental result: 

Theorem 1. There is always a minimizer of Is, if one exists, lying in a finite 
dimensional subspace Tn of T, with dimension at most n. The space Tn is the 
complementary subspace of ker{As). 
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Proof. From the preceding discussion, it clearly follows that the problem of min- 
imizing Is over T is equivalent to minimizing Is over the subspace Tn- This 
subspace has dimension at most n 

dim{Tn) = dim{T lker{ A sf) < n 

Thus if Is has minimizers in T , then it must have one minimizer lying in Tn- □ 

Corollary 1. Suppose the problem of minimizing Is over Tn has a set of solu- 
tions F* , then the set of all minimizers of Is over T has the form 

F* + ker{As) = {f* + r\f &F\r& ker{As)} (9) 

Proof. This is obvious. □ 

Consider now the problem of minimizing the regularized functional 

n 

Isnif) = 'E^ifi^^),y^)+l^if) ( 10 ) 

where 17 is a strictly convex, coercive functional on T . We have another key 

result: 

Theorem 2. The functional Is,-f has a unique minimizer in T . Assume further 
that the regularizer 17 satisfies: 

^{f) = n{U + r)>n{U) ( 11 ) 

for all f € F, where fn € Tn and r G ker{As). Then this minimizer lies in the 

finite dimensional subspace 

Proof. The existence and uniqueness of the minimizer /* is guaranteed by the 
coercivity and strict convexity of the regularizer 17, respectively. If furthermore, 
^{fn + ?’) > 12(/n) then for all f € F: 

Isnif) > Isnifn) 

Thus a function /* minimizing Is^-y must lie in the finite dimensional subspace 
Fn of F. □ 

Without the assumption of strict convexity and coercivity of the functional 17, 
we can no longer state the uniqueness or existence of the minimizer, but we still 
have the following result 

Theorem 3. Suppose the functional 17 satisfies 

^if) = ^ifn + r)> 17(/„) (12) 

for all f € F, where fn G Fn and r G ker(As), with equality iff r = Q. If the 
problem of minimizing Is,-y over F has a solution f*, it must lie in the finite 
dimensional subspace Fn. 
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Proof. This is similar to the above theorem. □ 

Having the above key results, the Representer Theorem can then be obtained if 
we can exhibit a basis for the above finite dimensional subspace J^n via the data 
points Xi (1 < t < n). 

Example 1 (RKHS). Let E = 'Hk be the reproducing kernel Hilbert space 
induced by a Mercer kernel K, then from the reproducing property f{x) = 
{f,K{x , .)), it follows that 

ker{As) = span{K{xi, 

From the unique orthogonal decomposition 

y-K = span{K{xi, © span{K{xi, 

it follows that En = span{K{xi, LI 

In section 2, we apply the above framework to derive the Representer Theorem 
for the special case E is the vector space of all algebraic polynomials on a 
compact subset of the real line M. We then generalize this result to the case of 
a general compact metric space in sections 3 and 4. 

2 Learning Over Compact Snbsets of M 

Let X C K be compact. Let P{X) be the vector space of all algebraic polynomials 
on X, then P{X) is dense in C{X) according to Weierstrass Approximation 
Theorem: 

Theorem 4 (Weierstrass Approximation Theorem). Each continuous 
function f € C(X) is uniformly approximahle by algebraic polynomials: for each 
e > 0, there is a polynomial p G P{X) such that 

\f{x)-p{x)\<e (13) 

for all X € X. 

Consider the problem of minimizing the functional Is over P{X). 

Lemma 1. 

ker(As) = {/ G P{X) : f{x) = {x - Xi) . . . {x - x„)r„(a;)} (14) 

for some r„ G P{X). Let P„(A) = span{l, {x — x\), {x — x\){x — X 2 ), • ■ • , (a: — 
xi) . . .{x — Xn-i)}, then P{X) admits the following unique decomposition 

P{X) = P„(A) © ker{As) (15) 

Proof. First we note that / G ker{As) (f{xi), . . . , /(x„)) = (0, . . . ,0) iff 
(1 <*<■«■) is a zero of / iff / contains the linear factor (x — xf) (1 < i < n), 
hence the form of ker{As). 

To prove the unique decomposition, we apply the Taylor expansion to /, with 
centers xi, . . . ,Xn successively: 
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f{x) = co + {x- xi)ri{x) = co + {x- xi)[ci + (x - X2)r2{x)] 

= Co + ci(x — xi) + {x — xi){x — X2)r2{x) = ■ ■ ■ = 

= Co + Ci(x — Xi) + C2{x — Xx){x — X2) + . . . + C„_i(x — Xi) . . . (x — X„_i) 

+(x - Xi) . . . (x - x„)r„(x) 

with Ci G K (0 < z < n — 1 ). □ 

The basis {n;=i (x — Xj)}"^g^ for Pn{X) is not symmetric in the x^’s. Let us 
construct a symmetric basis for this subspace. 

Lemma 2. 

Pu{X) = span{J]^(x - Xj)}”^i ( 16 ) 

j¥=i 

Proof. Let / = Cz n}=i(^ ~ ^j)- Define the function 



with 



='^d^Y[{x-Xj) 

i=l i/* 

_ Yj=o ^3 ni=i(^* ~ 

- Xj) 



(17) 



(18) 



It is straightforward to verify that /*(xj) = g*{xi) (1 < z < n). Since /* and g* 
have degree n— 1, it follows that f* = g*. □ 



We arrive at the following Representer Theorem for the space P{X): 



Theorem 5 (Representer Theorem). The problem of minimizing the func- 
tional Is over space P{X) is equivalent to minimizing Is over the finite- 
dimensional subspace Pn{X) = span{Y[j^Z^ — Xj)}Yi - Suppose the latter prob- 
lem has a set of solutions F* , then the set of all minimizers of Is{f) over P{X) 
has the form: 



F* + ker{As) = {/* + (x - xi) . . . (x - x„)r„ \ f* G F*, r„ G P{X)} (19) 



Each f* G F* admits a unique representation: 

n 

/* = ^ c* J^(x - Xj) (20) 

i=i j^i 

for Cj G K (1 < z < n). 

Proof. This is a special case of theorem 1, with = P„(X). □ 
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3 The Stone- Weierstrass Theorem 

Let us now consider the general case where X is a compact metric space. We 
then have Stone’s generalization of Weierstrass Approximation Theorem. For a 
very accessible treatment of this topic, we refer to [1]. 

Definition 1 (Algebra). A real algebra is a vector space A over K together with 
a binary operation representing multiplication: x,y G A ^ xy G A satisfying: 

(i) Bilinearity: for all a, & G K and all x,y,z G A: 

{a.x + b.y)z = a.xz + b.yz 
x{a.y + b.z) = a.xy + b.xz 

(ii) Associativity: x(yz) = (xy)z 

The multiplicative identity, if it exists, is called the unit of the algebra. An algebra 
with unit is called a unital algebra. A complex algebra over C is defined similarly. 



Definition 2 (Normed algebra-Banach algebra). A normed algebra is a 
pair {A, || ||) consisting of an algebra A together with a norm || || : A — >■ [0,oo) 
satisfying 



IMI<INIIMI (21) 

A Banach algebra is a normed algebra that is a Banach space relative to its given 
norm. 



Example 2. C{X)\ Let A be a compact Hausdorff space. We have the unital 
algebra C{X) of all real-valued functions on X, with multiplication and addition 
being defined pointwise: 

fg{x) = f{x)g{x) and (/ -h g){x) = f{x) + g{x) 

Relative to the supremum norm || ||oo, C{X) is a commutative Banach algebra 
with unit. 



Definition 3 (Separation). Let X be a metric space. Let A be a set of real- 
valued functions on X. A is said to separate the points of X if for each pair x, y 
of distinct points of X there exists a function f G A such that f{x) f{y). 



Theorem 6 (Stone- Weierstrass Theorem). Let X be a compact metric 
space and A a subalgebra of C{X) that contains the constant functions and 
separates the points of X. Then A is dense in the Banach space C{X). 
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4 Learning Over Compact Metric Spaces 

Let {X, d) be a compact metric space containing at least two points. 

Proposition 1. Let A be the subalgebra of C{X) generated by the family 

{l,(px ■ t ^ d(x, t)}a;gx (22) 

where 1 denote the constant function with value 1, then A is dense in C{X). 

Proof. By the Stone- Weierstrass Theorem, we need to verify that A separates 
the points of X. Let ti,t 2 be two distinct points in X, so that d{ti,t 2 ) 0- 
Suppose that d{x,ti) = d{x,t 2 ) for all x G X. Let x = ti, we then obtain: 

d{ti,t2) = dftiAi) = 0 

a contradiction. Thus there must exist x G X such that d{x,ti) ^ d{x,t 2 ), 
showing that A separates the points in X. □ 

Consider the algebra A defined in the above proposition and the problem of 
minimizing Is over A. 

Lemma 3. Each f G A can be expressed in the form: 

f = g + d{xi,.) . . .d{Xn,-)fn+l (23) 

where 

g= fi + d{xi, .)/2 -L d{xi, .)d{x 2 , .)f 3 + ... + d{xi, .)d{x 2 , ■) ■ ■ ■ d{Xn-l, -)fn 

(24) 

and fn+i G A, fi G A/{d{xi, .)) with {d{xi, .)) being the ideal generated by d{xt , .) 

, 1 < i < n. 

Proof. This is similar to a Taylor expansion: clearly there is fi € A/{d{xi, .)) 
such that 

f = fi + d{xi, .)ri = fi + d{xi, .)[/2 -L d{x2, .)v2] 

= fi + d{xi, .)/2 -I- d{xi, .)d{x 2 , .)r 2 

Continuing in this way we obtain the lemma. □ 

Since f{xi) = gixf) (1 < t < n), minimizing Is over A is equivalent to minimiz- 
ing Is over all / of the form: 

/ = /i + d{xi,.)f 2 + d{xi, .)d{x 2 , .)fs + ... + d{xi, .)d{x 2 , .) ■ . . d{xn-i, .)/„ 

(25) 

with fi G A/{d{xi, .)). From the above equation, we obtain for 1 < i < n: 

i j-1 

fi^i) = d{xk,Xi) 

j=l k=l 



( 26 ) 
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It is straightforward to verify that 



k—1 i—k 



IV^=k,j^^ d{xj,Xi) 



( 27 ) 



From the above general expression for /, it follows that there are constants 
Ci G K (1 < t < n) such that 



n 

i=l j^i 

Let Pn{X) denote the n-dimensional subspace of A defined by 

Pn(X) = span{Y[d{xj,.)}f^^ (29) 

j¥=i 



We have proved the following theorem: 

Theorem 7 (Representer Theorem). The problem of minimizing the func- 
tional Is over A is equivalent to minimizing Is over the n-dimensional subspace 
Pn{X) = span{Y[j^id{xj, Suppose the latter problem has a set of solu- 

tions F* , then the set of minimizer of Is over A has the form 



{f* + d{xi, .)... d{xn, .)fu+i ■■ r G F*,U +1 G .4} (30) 

Each f* G F* admits a unique representation 

n 

f* = ^c^^d{xj,.) (31) 

i=l j^i 

for Ci gM (1 < i < n). Let 12 be as in theorem 2, then the problem of minimizing 
the functional Is,-f over A has a unique solution lying in Pn{X). 

Proof. This is a special case of theorems 1 and 2, with = Pn{X). □ 

We now show that the algebra A consists of Lipschitz functions and that it is 
dense in the space Lip{X) of all Lipschitz functions on X, in the supremum 
norm: 

Lemma 4. For each x G X, the function 4>x : t ^ d{x,f) is Lipschitz with 
Lipschitz constant L{(j>x) = 1- 

Proof. Let ti,t 2 G X. From the triangle inequality, we have: 
d{x, ti) < d{x, t 2 ) + d{t 2 , ti) ^ d{x, ti) - d{x, t 2 ) < d{ti,t 2 ) 

Similarly, we have d{x,t 2 ) — d{x,ti) < d{t\,t 2 )- It follows that 
\4>x{ti) - (fx{t 2 )\ = \d{x,ti) - d{xA 2 )\ < d{ti,t 2 ) 
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with equality iff = a; or t2 = x. Thus 4>x is a Lipschitz function with Lipschitz 
constant L{4>x) = 1- □ 



Proposition 2. Let X he a compact metric space and A defined as above. Then 
A consists of Lipschitz functions and A is dense in Lip{X) in the supremum 
norm. 

Proof. Since Lipschitz functions are closed under addition, scalar multiplication, 
and for X bounded, pointwise multiplication (see appendix), it follows from the 
above lemma that A consists of Lipschitz functions, that is ^ is a subalgebra 
of Lip{X). Since for compact X, both A and Lip{X) are dense in C{X) in the 
supremum norm, it follows that A is dense in Lip{X) in the supremum norm. □ 

5 Least Square Minimization and Regularization 

5.1 Least Square Minimization 

Let S = {{xi, 2/i)}r=i & (A X R)” be a training sample of length n. Consider the 
problem of minimizing the empirical square error over A: 

n 

= (32) 



or equivalently 



Isif) = E(/(^*) - y*)' (33) 



By the Representer Theorem, this is equivalent to minimizing the func- 
tional Isif) over the finite dimensional subspace Pn{X). Let / = 

Er=i c* d{xj, .) G Lip{X). Let 

Mi = d{xj, Xi) 

then clearly 



f{xi) = 



Theorem 8. The problem of minimizing the functional Is{f) over the finite 
dimensional subspace Pn{X) has a unique solution 



d{xj,.) 



f* = = d(x- x ) 

Proof. Each / G Pn{X) has the form: / = Yl'i=i d{xj, .). Thus 

f{xi) = Cj Hj^i d{xj,Xi) = 



(34) 
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Clearly the smallest value that Is{f) assumes is zero, which occurs iff 
f{xi) = Vi CiMi = 4=^ c* = ^ 

This gives us the desired minimizer /*. □ 

Remark 1. Let (bAx) = then we have 

<bi{xj) = Sij and f*{x) = 

In the case X C M, these functions are precisely the Lagrange interpolation 
polynomials and we recover the Lagrange interpolation formula. 

5.2 Least Square Regularization 

The minimization process above always gives an exact interpolation, which 
may lead to the undesirable phenomenon of overfitting. Hence we consider 
the following regularization problem. Each function f G A has the form 
f = Y1 7c/ Hjcj d{xj , .) where I is a finite index set. Consider the functional 
C : .4 — >■ K defined by 



^(/) = 

Jci 



(35) 



Lemma 5. Let f G A with the decomposition: f = g + d{x\, .) . . . d{xn, ■)fn+i 
where g G P„{X) and /„+i G A. Then C(/) = n{g) + C(/„+i). 

Proof. This is obvious. □ 

Lemma 6. The functional 17 is strictly convex. 

Proof. This follows from the strict convexity of the square function. □ 

Lemma 7. Let f = X),/c/ Hjcj d(xj, .) G A. Then 

ll/lloo < X! \cj\diam{xy-^^ < (36) 

,/c/ ./c/ ./c/ 

The functional 17 is coercive in the supremum norm: 

lim !?(/) = oo (37) 

ll/IU-s-oo 

Proof. We have 

\\Iljejd{xj,.)\\oo < dzam(X)l-^l 
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It follows that 



< E,/c/ \cj\diam{X)\J\ < 

Thus ll/lloo — >■ oo implies that X)jc/ 
coercive in the supremum norm. 



JC, 

oo as well, showing that ^2 is 

□ 



Lemma 8. Let f = d{x\, .) . . . d{xk, ■)• Then 

L{f) < kdiam{X)^-^ (38) 

f = X)jc/ '"4 rijGj d{xj, .). Then there is a constant C > 0 such that 

Hf) <CY,\cj\< CiY, (39) 

,/C/ JCI JCI 

In particular, for f = Y^=i O Tlj^i d{xj , .), we have 

n n 

i(/)<C'^|Q|<Cv^(^|ci|2)i/2 (40) 

i=l i=l 

Proof. The first inequality follows from a standard induction argument. This 
and the Cauchy-Schwarz inequality imply the other inequalities. □ 

Consider the problem of minimizing the regularized functional: 

n 

^s.7(/)=E(/(^*)-2^*)' + ^^(/) (41) 

i=l 



with regularization parameter 7 > 0. By lemmas 7 and 8, this regulariza- 
tion process aims to minimize X)r=i(/(2^*) ~ 2/*)^ penalize ||/||oo and L{f) 
simultaneously. 

Theorem 9. The problem of minimizing the regularized functional Is^j{f) over 
the algebra A has a unique solution f* which lies in the finite dimensional sub- 
space P„(X): 



f" = Il 



Vi Mi 

y + Mf 



n^(^7’-) 






(42) 



Proof. The functional 12 is strictly convex and coercive in the supremum norm 
on A and satisfies 12{f) = f2{g) l7(/„+i) > f2{g). Thus by the Representer 

Theorem, there is a unique solution minimizing Is,-y(f), which lies in the finite 
dimensional subspace Pn{X). We have for / G Pn{X): 

Is,M) = Etiiic^Mi - y,r + 7cf] = E”=i[cf (7 + Mf) - + yf] 

Differentiating and setting ^ = 2ci{^ -\- Mf) — 2Miyi = 0, we obtain 



as claimed. 



□ 
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6 The Lipschitz Classifier 

Let C {X X {±1})” be a set of training data, with the assumption 

that both classes ±1 are present. Let Xq = X U {e} where e is a distinguished 
base point with the metric d^°\xxx = d and d^°{x,e) = diam{X) for x £ X. 
It is straightforward to show that 

Proposition 3 ( [5]). Lip{X) is isometrically isomorphic to Lipo^Xo) via the 
map : Lip{X) — >■ Lipo^Xg) defined by {il)f){x) = f{x) for x £ X and 
(■0/)(e) = 0. One has ||/|h = i(V'/)- 



Proposition 4 ( [6]). X embeds isometrically into the Banach space AE{Xq), 
via the map <!>{x) = rrix = rUxe = Xx ~ Xe- Lipoi^o) embeds isometrically into 
the dual space AE{X(f)* , via the map T : Lipo^Xo) -£ AE{X(f)* defined by 
{Tf,m) = /(^)^(^) S Lipo{Xo), all m £ AE(Xo). Clearly 

f{x) = {Tf,mx) for all x £ X. 

The problem of finding a decision function / £ Lip(X) separating the points 
Xi’s in X is then equivalent to that of finding the corresponding linear functional 
Tf £ AE{Xo)* separating the corresponding molecules mxi, that is a hyperplane 
Hf defined by Hf = {m G AE{Xo) : {Tf, m) = 0}. It is straightforward to show 
the following 

Proposition 5 (Margin of the Lipschitz Classifier [5]). Assume that 
the hyperplane is normalized such that mini<i<„ |/(cci)| = 1 and suppose that 
Vifixi) >1 (1 < z < n). Then 



P = 



inf WrrLxi 



mh\\AE > 



1 

m 



(43) 



Thus the following algorithm then corresponds to a large margin algorithm in 
the space AE{Xq): 

Algorithm 1 ( [5]) 



Minimize f^Lip(x)T{f) subject to yif{xi) > 1 (1 < z < n) (44) 

The solutions of this algorithm are precisely the minimal Lipschitz extensions of 
the function / : {xi}f^i -£ {±1} with f{xi) = yi, as we show below. 



6.1 Minimal Lipschitz Extensions 

The following was shown simultaneously in 1934 by McShane [4] and Whitney 

[7]. 

Proposition 6 (Minimal Lipschitz Extension-MLE). Let {X,d) denote an 
arbitrary metric space and let E be any nonempty subset of X. Let / : E — >■ M &e 
a Lipschitz function. Then there exists a minimal Lipschitz extension of f to 
X, that is a Lipschitz function /z : A — >■ K such that h\E = f and L{h) = L{f). 
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Proof. Two such minimal Lipschitz extensions were constructed explicitly in [4] 
and [7]: 



f{x) = inf {/(y) + L{f)d{x,y)} 

y&E 


(45) 


f{x) = sup{/(y) - L{f)d{x,y)} 

VGE 


(46) 



Furthermore, if u is any minimal Lipschitz extension of / to X, then for all 
x€X: 



l(x) < u{x) < f{x) (47) 

We refer to the above references for detail. □ 

Let us return to the classification problem. Let E = {xi}f^i and f : E ^ 
{±1} be defined by f{xi) = yi. Let and X~ denote the sets of train- 
ing points with positive and negative labels, respectively. Let d{X^,X~) = 
X <^x- d{x,x ). It is straightforward to see that / is Lipschitz with Lip- 
schitz constant L* = x~) • above proposition gives two of f’s minimal 

Lipschitz extensions: 

f{x) = Ta\Yii{yi + L*d{x,Xi)} and /(x) = maxi{j/j - L*d{x,Xi)} 

These are precisely the solutions of the above algorithm in [5]. 

Remark 2. The notion of minimal Lipschitz extension is not completely satis- 
factory. Firstly, it is not unique. Secondly, and more importantly, it involves 
only the global Lipschitz constant and ignores what may happen locally. For a 
discussion of this phenomenon, we refer to [3]. 



6.2 A Variant of the Lipschitz Classifier 

The problem of computing the Lipschitz constants for a class of functions is 
nontrivial in general. It is easier to obtain an upper bound for L{f) and minimize 
it instead. Let us consider this approach with the algebra A, which is dense in 
Lip{X) in the supremum norm as shown above. 

From the above upper bound on L{f), instead of minimizing L(/), we can 
minimize '^j^i |cj|. We obtain the following algorithm: 

Algorithm 2 

Minimizeicfi |cj| subject to yif{xi) > 1 (1 < i < n) 

■Jci 



( 48 ) 
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The functional ^ R defined by 

^if) = E (49) 

Jci 

clearly satisfies i7{g + d{xi, . d{xn, -)fn+i) > ^{g) for all g G Pn{X) and 
fn+i G A, with equality iff fn+i = 0. Thus by theorem 3, we have the equivalent 
problem: 

Algorithm 3 



n 

Minimize \ci\ subject to yiCiMi > 1 (1 < i < n) (50) 

i=l 



According to lemma 7, the functional 17 is coercive in the || ||oo norm, thus the 
problem has a solution. Let us show that it is unique and find its explicit form. 

Theorem 10. The above minimization problem has a unique solution 



r 






) = Y.y^ 



d(xj,Xi) 



(51) 



Proof. X)r=i obviously minimum when piCiMi = 1, implying that 




as we claimed. □ 



Remark 3. Clear we have f{xi) = yt. From lemma 8, we have L{f) < 
\ci\. Thus it follows that 

„ > > T 1 

L(f) ^ CE"=ikd 

Thus the above algorithm can also be viewed as a large margin algorithm as 
well. 



7 Conclusion 

We presented a general method for deriving the Representer Theorem in learning 
algorithms. The method is applied to a dense subalgebra of the space of Lipschitz 
functions on a general compact metric space X . We then used the Representer 
Theorem to obtain solutions to several special minimization and regularization 
problems. This approach may be used to obtain solutions when minimizing other 
functionals over other function spaces as well. We plan to continue with a more 
systematic regularization method and comprehensive analysis of our approach 
in future research. 
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A Lipschitz Functions and Lipschitz Spaces 

We review some basic properties of Lipschitz functions and the corresponding 
function spaces. For detail treatment we refer to [6]. Let X he a metric space. 
A function / : A — >■ M (or C) is called Lipschitz if there is a constant L such 
that for all x,y £ X: 



\f{x) - f{y)\ < Ld{x,y) 



(52) 



The smallest such L is called the Lipschitz constant of /, denoted by L{f). We 
have 



L{f) = sup 

x^y 



\f{x) - f{y)\ 

d{x,y) 



(53) 



Proposition 7 ( [6]). Let X be a metric space and f, g, f„ (n G N) be Lipschitz 
functions from X into K (or C). Then: 

(a) L{af) = |a|L(/) for all a gM. 

(b) L{f + g)<L{f) + L{g) 



Proposition 8 ( [6]). Let X be a metric space and f,g:X^R (C) be bounded 
Lipschitz functions. Then 

(a) L{fg)<\\f\\^L{g) + \\g\\M) 

(b) Lf diam{X) < oo, then the product of any two scalar-valued Lipschitz func- 
tions is again Lipschitz. 



Definition 4 ( [6]). Let X be a metric space. Lip{X) is the space of all bounded 
Lipschitz functions on X equipped with the Lipschitz norm: 



\l = max{||/||oo,L(/)} 

If A is a bounded metric space, that is diam{X) < oo, we follow [5] and define: 

U = niax{J|J^,L(/)} 



Theorem 11 ( [6]). Lip{X) is a Banach space. Lf X is compact, then Lip{X) 
is dense in C{X) in the supremum norm. 



Definition 5. Let Xq be a pointed metric space, with a distinguished base point 
e. Then we define 



LzpoiXo) = {/ G Lip{Xo) : /(e) = 0} (54) 



On this space, L{f) is a norm. 
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Definition 6 ( Arens- Eells Space). Let X be a metric space. A molecule of 

X is a function m : X ^ (or C) that is supported on a finite set of X and 
that satisfies: 

= 0 

For x,y G X, define the molecule mxy = Xx — Xy> where Xx and Xy denote 
the characteristic functions of the singleton sets {a;} and {y}. On the set of 
molecules, consider the norm: 

\\m\\AE = inf{E”=i \ai\d{xi,y^) : m = 

The Arens- Eells space AE{X) is defined to be the completion of the space of 
molecules under the above norm. 
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Abstract. Kernel-based methods are powerful for high dimensional 
function representation. The theory of such methods rests upon their 
attractive mathematical properties whose setting is in Hilbert spaces 
of functions. It is natural to consider what the corresponding circum- 
stances would be in Banach spaces. Led by this question we provide 
theoretical justifications to enhance kernel-based methods with function 
composition. We explore regularization in Banach spaces and show how 
this function representation naturally arises in that problem. Further- 
more, we provide circumstances in which these representations are dense 
relative to the uniform norm and discuss how the parameters in such 
representations may be used to fit data. 



1 Introduction 

Kernel-based methods have in recent years been a focus of attention in Machine 
Learning. They consist in choosing a kernel AT : x H — >■ IR which provides 

functions of the form 

( 1 - 1 ) 

3&m 

whose parameters Dm = {xj \ j & C D and c = {cj : j G Z^} C IR 
are used to learn an unknown function /. Here, we use the notation Z^ = 
{0, . . . , m — 1}. Typically K is chosen to be a reproducing kernel of some Hilbert 
space. Although this is not required, it does provide (1.1) with a Hilbert space 
justification. The simplicity of the functional form (1.1) and its ability to address 
ejficiently high dimensional leaxmng tasks make it very attractive. Since it arises 
from Hilbert space considerations it is natural to inquire what may transpire in 
other Banach spaces. The goal of this paper is to study this question, especially 
learning algorithms based on regularization in a Banach space. A consequence 
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of our remarks here is that function composition should be introduced in the 
representation (1.1). That is, we suggest the use of the nonlinear functional 
form 

where ^ : IR — >■ IR and for j € gj : -D —>■ M are prescribed functions, 
for example (but not necessarily so) gj = K(xj,-). In section 2 we provide an 
abstract framework where in a particular case the functional form (1.2) naturally 
arises. What we say here is a compromise between the generality in which we 
work and our desire to provide useful functional forms for Machine Learning. 

We consider the problem of learning a function / in a Banach space from 
a set of continuous linear functionals Lj(f) = yj, j G Z^. Typically in Ma- 
chine Learning there is available function values for learning, that is, the Lj are 
point evaluation functionals. However, there are many practical problems where 
such information is not readily available, for example tomography or EXAFS 
spectroscopy, [15]. Alternatively, it may be of practical advantage to use “local” 
averages of / as observed information. This idea is investigated in [23, c. 8] in 
the context of support vector machines. Perhaps, even more compelling is the 
question of what may be the “best” m observations that should be made to 
learn a function. For example, is it better to know function values or Fourier 
coefficients of a periodic function? These and related questions are addressed in 
[18] and lead us here to deal with linear functionals other than function values 
for Machine Learning. 

We are especially interested in the case when the samples yj, j G are 
known to be noisy so that it is appropriate to estimate / as the minimizer in 
some Banach space of a regularization functional of the form 

E{f)--= E Q{yj,L,{f))+H{\\f\\) (1.3) 

ieZm 



where H : IR+ — >■ IR+ is a strictly increasing function, and Q : IR x BL IR+ is 
some prescribed loss function. If the Banach space is a reproducing kernel Hilbert 
space, the linear functionals Lj,j G Z^ are chosen to be point evaluations. In 
this case a minimizer of (1.3) has the form in equation (1.1), a fact which is 
known as the representer theorem, see e.g. [22,25], which we generalize here to 
any Banach space. 

We note that the problem of minimizing a regularization functional of the 
form (1.3) in a, finite dimensional Banach has been considered in the case of sup- 
port vector machines in [1] and in more general cases in [26]. Finite dimensional 
Banach spaces have been also considered in the context of on-line learning, see 
e.g. [9]. Learning in infinite dimensional Banach spaces has also been considered. 
For example, [7] considers learning a univariate function in Lp spaces, [2] ad- 
dresses learning in non-Hilbert spaces using point evaluation with kernels, and 
[24,6] propose large margin algorithms in a metric input space by embedding 
this space into certain Banach spaces of functions. 
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Since the functions (1.2) do not form a linear space as we vary c G M™, we 
may also enhance them by linear superposition to obtain functions of the form 

where {oj : j G Z„} C IR and {cjk '■ j € Zn,k € Z^} C IR are real- valued 
parameters. This functional form has flexibility and simplicity. In particular, 
when the functions {gj : j G Z^+i} are chosen to be a basis for linear functions 
on M™, (1.4) corresponds to feed-forward neural networks with one hidden layer, 
see for example [12]. 

In section 3 we address the problem of when functions of the form in equation 
(1.4) are dense in the space of continuous functions in the uniform norm. Finally, 
in section 4 we present some preliminary thoughts about the problem of choosing 
the parameters in (1.4) from prescribed linear constraints. 

2 Regularization and Minimal Norm Interpolation 

Let ft be a Banach space and X* its dual, that is, the space of bounded linear 
functionals L : T — >■ IR with the norm ||L|| := sup{L(a;) : ||a;|| < 1}. Given 
a set of examples {{Lj,yj) : j G Z^} C x IR and a prescribed function 
V : IR™ X IR+ — >■ IR which is strictly increasing in its last argument (for every 
choice of its first argument) we consider the problem of minimizing the functional 
if : T — >■ IR defined for x G T as 

E{x) := V ((Lj(x) : j G Z^), ||x||) (2.5) 

over all elements x in X (here V contains the information about the yj). A 
special case of this problem is covered by a functional of the form (1.3). Suppose 
that xo is the solution to the above problem, x is any element of X such that 
Lj{x) = yj,j G Zjn where we set yj := Lj(xo),j G Z^. By the definition of xq 
we have that 

V{y,\\xo\\)<V{y,\\x\\) 

and so 

||xo|| = min{||x|l : Lj{x) = yjJ G Z^, x G X}. (2.6) 

This observation is the motivation for our study of problem (2.6) which is usually 
called minimal norm interpolation. Note that this conclusion even holds when 
||x|| is replaced by any functional of x. 

We make no claim for originality in our ensuing remarks about this prob- 
lem which have been chosen to show the usefulness of the representation (1.2). 
Indeed, we are roaming over well-trodden ground. 

Thus, given data {yj : j G Z^} C IR\{0}, we consider the minimum norm 
interpolation (MNI) problem 



g := inf {||x|| : Lj{x) = yj, j G Z„, x G T} . 



(2.7) 
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We always require in (2.7) that corresponding to the prescribed data y := (yj : 
j G 1‘m) there is at least one x G X for which the linear constraints in (2.7) are 
satisfied. In addition, we may assume that the linear functionals {Lj : j G 
are linearly independent. This means that whenever a := (a^ : j G Z^) is such 
that ~ ^ then a = 0. Otherwise, we can “thin” the set of linear 

functionals to a linearly independent set. 

We say that the linear functional L G A’*\{0} peaks at x G df\{0}, if L(x) = 
||L||||a;||. Let us also say that x peaks at L, if L peaks at x. A consequence of 
the Hahn-Banach Theorem, see for example [21, p. 223], is that for every x G X 
there always exists an L G X* which peaks at x and so, ||a:|| = max{L(a;) : 
ll-^ll < 1, L G X*}, see [21, p. 226, Prop. 6]. On the other hand, the supremum in 
the definition of ||L|| is not always achieved, unless L peaks at some x G A’\{0}. 
We also recall that X is weakly compact if, for every norm hounded sequence 
{xn : n G Z_|_} C X there exists a weakly convergent subsequence {x'^ : n G IN}, 
that is, there is an cc G such that for every L G X* lim„_>oo L{x'^) = L{x). 
When X is weakly compact then for every L G X* there is always an x G X 
which peaks at L. Recall that a Banach space X is reflexive, that is, {X*)* = X 
if and only if X is weakly compact, see [16, p. 127, Thm. 3.6] and it is known 
that any weakly compact normed linear spaces always admit a minimal norm 
interpolant. 

If is a closed subspace of X, we define the distance of a; to as 
d{x,M) := min{|jx — t\\ : t G A4}. 

In particular, if we choose Mq := {x : x G X,Lj{x) = 0,j G Z^j and any 
w G X such that Lj (w) = yj , j G Z^ then we have that 

d(w,Mo)=y. (2.8) 



Theorem 1. xq is a solution of (2.7) if and only if Lj^xq) = yj,) G and 
there exists (cj : j G Z^) G K™ such that the linear functional 
peaks at xq. 

Proof. We choose in (2.8) w = xq so that Lj{xo) = yj,j G Z^ and ||a:o|[ = 
d{xo,Mo). Using the basic duality principle for the distance (2.8), see for exam- 
ple [8], we conclude that 

lixoll = max{L(a:o) : L{x) = 0, x G Mq, \\L\\ < 1} . (2.9) 

However, L vanishes on Mq if and only if there exists {cj : j G Zm) G such 
that L = XjeZm (2-9) there is such an L which peaks at xq. 

On the other hand, if for some (cj : j G Z^) G K™ the linear functional 
P^^ks at Xq with Lj(xo) = yj,j G Zm we have, for every t G Mq, 

that 



i6Z„ 



Ikoll = CjLj{xo + t) < ||xo-ktll 



ieZm 



and so, xq is a minimal norm interpolant. 



□ 
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This theorem tells us if xq solves the MNI problem then there exists {cj,j G 
Nm} C IR such that ||a:o|| = = XjeZm How do we find the 

the parameters {cj : j € Z^}? This is described next. 

Theorem 2. // T be a Banach space then 



min 



j^Wjm 



CjVj = 

j&m 



1/^. 



( 2 . 10 ) 



In addition, if X is weakly compact and c is the solution to (2.10) then there 
exists X € X such that ||o;|| = 1, Lj{x) = yj/pt,j G and XjeZm ~ 

II ^j6Zm 

Proof. Since the function H : IR™ — >• 1R_|_ defined for each c = {cj : j € Z^) by 
:= II *' 1^1 II continuous, homogeneous and nonzero for c yf 0 , it 

tends to infinity as c — >■ oo, so the minimum in (2.10) exists. The proof of (2.10) 
is transparent from our remarks in Theorem 1. Indeed, for every w G X such 
that Lj{w) = yj, j G Z^ we have that /i = d{w, A4o) and 



p. = ma,x{L{w) : L{x) = 0,x G Mo, ||T|| < 1}. 



Moreover, since L vanishes on Mo if and only if L = XjeZm some 

c = {cj : j G Zm), the right hand side of this equation becomes 



max <1 CjVj : 




< 1 1 = min i 






3^'^m 


J V 1 


j 




from which equation (2.10) follows. 

For vectors c = {cj : j G Z^), d = {dj : j G Z„) in IR™, we let c • d = 
the standard inner product on IR™. Let c := (cj : j G Z^) be a 
solution to the minimization problem (2.10) and consider the linear functional 



L CyLj. 



This solution is characterized by the fact that the right directional derivative of 
the function H &t c along any vector a = (uj : j G Z^) perpendicular to y is 
nonnegative. That is, we have that 



H'{c; a) 



H(c+ Xa) — H(c) 

lim 

A— >-0+ A 



when a ■ y = 0. This derivative can be computed to be 



H'{c; a) = max 



YujLjix) : ||x|| < 1 



(2.11) 



( 2 . 12 ) 
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see [13]. We introduce the convex and the compact set C := {{Lj{x) : j G Z^) : 
||a^|| < 1} C K™. If a is perpendicular to y then, by the inequality (2.11) and 
the formula (2.12), we have that 

max{a • -u : u G C} > 0. (2.13) 

We shall now prove that the line £:= {Xy : X £ M} intersects C. Suppose to 
the contrary that it does not. So, there exists an hyperplane {z : u ■ z + t = 0} 
where u G and r G M, which separates these sets, that is 

(z) u-z + T>0, z£C, (ii) u ■ z + T < 0, z£C 

see [21]. From condition (i) we conclude that u is perpendicular to y and r > 0 
while (ii) implies that max{rt • u : z; G C} < 0. This is in contradiction to (2.13). 
Hence, there is an x such that Lj{x) = yj/y,j G Z^, L{x) = |[L|| and ||a;|| = 1. 
Therefore, it must be that xq ■= /rx is a MNS. □ 

This theorem leads us to a method to identify the MNS in a reflexive smooth 
Banach space X. Recall that a reflexive Banach space X is smooth provided that 
for every L G A’*\{0} there is unique xl £ X which peaks at L. 

Corollary 1. If X is a smooth reflexive Banach space, L := 

solution to (2.10) and L peaks at x^ with ||xi|| = 1 then Xq '■= ixxr is the unique 

solution to (2.7) and p, = 1/|]L||. 

We wish to note some important examples of the above results. The first to 
consider is naturally a Hilbert space X. In this case X is reflexive and X* can be 
identified with X, that is, for each Lj £ X*, there is a unique x^ £ X such that 
Lj{x) = {xflx),x £ X. Thus, x = solves the dual problem when 

(xflx) = Xyj,j G Zm, A = ||a;|[^ and Xq = x/||x|| is the minimal norm solution. 

The Hilbert space case does not show the value of function composition ap- 
pearing in (1.2). A better place to reveal this is in the context of Orlicz spaces. 
The theory of such spaces is discussed in several books, see e.g [17,20], and min- 
imal norm interpolation is studied in [3] . We review these ideas in the context of 
Corollary 1. Let co : [0, oo) — >■ [0, oo) be a convex and continuously differentiable 
function on [0,oo) such that lims_>oo w'(s) = oo and w(0) = w+(0) = 0 where 
w(|_ is the right derivative of u. Such a function is sometimes known as a Young 
function. We will also assume that the function s >->■ suj'{s/lo{s)), s G [0,oo) is 
bounded on [k,co) for some k £ [0, oo). Let (D,B,p) be a finite measure space, 
see [21, p. 286], L^{p) the space of measurable functions / : H — >■ IR, and denote 
by the convex hull of the set 




ll/ll^ :=inf<^ A>0: 



D 
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The dual of is the space where w* is the complementary function of uj 
which is given by the formula 

sG[0,oo). 

For every / G C^j and g G there also holds the Orlicz inequality 

where we have defined (/,(?) := f{t)g{t)dfj,{t). The Orlicz inequality becomes 
an equality if and only if 

/ = A(w*)'(| 5 |)sign( 5 ), (2.14) 

for some A G M. This means that the linear functional represented by g G 
peaks at / if and only if / satisfies equation (2.14). Moreover, under the above 
conditions on co, Coj is reflexive and smooth. Thus the hypothesis of Corollary 
1 is satisfied and we conclude that the unique solution to (2.7) is given by 
/ = where (1)^^ is defined for t G IR as 

= (u;*)'(|t|)sign(t) (2.15) 

and the coefficients A, Cj,j G solve the system of nonZmear equations (/, gj) = 
Uj ; 3 ■ 

As a special case consider the choice w(s) = s^/p,p > 1, s G [0, oo). In 
this case = O’, the space of functions whose p power is integrable, and the 
dual space is where l/p+ l/q = 1, [21]. Since w*(s) = s‘^/q, s G [0,oo), the 
solution to equations (2.5) and (2.7) has the form / = ^j9i) where 

for alH G IR </>5 is defined by the equation 

:= |i|'^"^sign(t). (2.16) 

3 Learning All Continuons Functions: Density 

An important feature of any learning algorithm is its ability to enhance accu- 
racy by increasing the number of parameters in the model. Below we present a 
sufficient condition on the functions (j) and {gj : j G Z^} so that the functions in 
(1.4) can approximate any continuous real-valued function within any given tol- 
erance on a compact set D C IR*^. For related material see [19]. Let us formulate 
our observation. 

We use C{D) for the space of all continuous functions on the set D and for 
any / G C{D) we set ||/||d := max{|/(a;)| : x G D)}. For any subset T of C{D) 
we use span{T) to denote the smallest closed linear subspace of C{D) containing 
T. We enumerate vectors in IR™ by superscripts and use g := {gj : j G 'Em) 
for the vector-valued map g : D ^ IR™ whose coordinates are built from the 
functions in ^ := : j G Em}- This allows us to write the functions in (1.4) as 

X! ■ g). 

jeZn 



(3.17) 
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For any two subsets A and jB of C(D) we use A ■ B for the set defined by 
A ■ B := {fg : f & At g & B} and, for every fc G IN, Al* denotes the set 
{/^ : / G A}. Given any (f) G C{D) we let M{4>) be the smallest closed linear 
subspace containing all the functions (3.17). Note that m is fixed while Ai{4>) 
contains all the functions (3.17) for any n. We use Ag for the smallest subalgebra 
in C{D) which contains Q, that is, the direct sum ©fceiN^^^- We seek conditions 
on (j) and g so that M{4>) = C{D) and we prepare for our observation with two 
lemmas. 

Lemma 1. If 4> G C(Z1)\{0} and 1 G span{Q) then 1 G M{4>). 

Proof. By hypothesis, there is a t G IR such that yf 0 and a c G IR™ such 
that c - g = t. Hence we have that 1 = 0 ^^(c ■ g) G □ 

Lemma 2. If <f)' G C(D) then A4{(f>') ■ Q C A4{(f>). 

Proof. We choose any function / of the form 

f = aj4>'{A ■ g) 

where a = {aj : j G Z„) G IR” and {A : j G Z„} C IR'". For any d G IR™ we 
define the function g = d ■ g. Let us show that f ■ q G M.{(j)). To this end, we 
define for t G IR the function 

ht ■= ^ aj(j>{{A + td) ■ g) 
ieZn 

and observe that limi_>o t~^{ht — ho) = f ■ q. Since {ht — ho : t G IR} C 
the result follows. □ 

We say that Q separates points on V when the map g : D ^ IR™ is injective. 
Recall that an algebra A C C{D) separates points provided for each pair of 
distinct points x and y G D there is an / G Al such that f{x) f{y). 

Theorem 3. If <f> G C°°(IR), 4> is not a polynomial, 1 G span{Q) and Q separates 
points then = C{D). 

Proof. Our hypothesis implies that Ag separates points and contains constants. 
Hence, the Stone- Weierstrass Theorem, see for example [21], implies that the 
algebra Ag is dense in C{D). Thus, the result will follow as soon as we show that 
Ag C Ai{4>). Since (p G C°°(IR) Lemma 2 implies for any positive integer k that 

(</>('=))• C (</>). 

Using Lemma 1 and the fact that </> is not a polynomial the above inclusion 
implies that C Consequently, we conclude that 

Alg - 0 0'= C M{P). 

fcGiN 



□ 
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We remark that the idea for the proof of Lemma 2 is borrowed from [4] where 
only the case that span{Q) is linear functions on and D is a subset of M™ 

is treated. We also recommend [12] for a Fourier analysis approach to density 
and [10] which may allow for the removable of our hypothesis that 4> G C°°{D). 

In Theorem 3 above m is fixed and we enhance approximation of an arbitrary 
function by functions of the special type (1.4) by adjusting n. Next, we provide 
another density result where m is allowed to vary, but in this case, g is chosen in 
a specific fashion from the reproducing kernel of a Hilbert space % of real-valued 
functions on D contained in C{D). Indeed, let K be the reproducing kernel for % 
which is jointly continuous on D x D. There are useful cases when "H is endowed 
with a semi-norm, that is, there are nontrivial functions in % with norm zero, 
see e.g [25] . To ensure that these cases are covered by our results below we specify 
a finite number of functions {kj : j G Z^} and consider functions of the form 

^ ' CjiG(-,a;j)+ ^ [ Cj^mkj- (3.18) 

We use 1C for the smallest closed linear subspace of C{D) which contains all the 
functions in (3.18) for any m and c = {cj : j G Z^+r) G IR'"'*'’’. Here the samples 
Dm ■= {xj : j G Zm} are chosen in D and, in the spirit of our previous discussion 
we compose the function in (3.18) with a function (j) to obtain functions of the 
form 

</>( CjK{-,x,)+ ^ ^j+rakj ) . 

We write this function as (j){c ■ w) where c G IR’”"'’’' and the coordinates of 
the vector map w : D ^ IR’"'*'’' are defined as Wj = K{-,Xj),j G Z„ and 
Wj+m = kj,j G Zr. We let /C(</>) be the smallest closed linear subspace containing 
all these functions. Our next result provides a sufficient condition on </> and w 
such that IC{(f>) is dense in C{D). To this end we write K in the “Mercer form” 

K(x,y) = ^ Xe(l>e{x)(j)i{y), x,y € D (3.19) 

where we may as well assume that yf 0 for all £ G Z_|_. Here, we demand that 
{(j)(, : £ G Z_|_} C C{D) and we require the series above converges uniformly on 
D X D. We also require that the set J = {f : < 0} has the property that 

{4>i : £ £ J} C span{kj : j G Z^.} (3.20) 

and that U := span{(j)i : £ G Z+} = C{D). When these conditions holds we call 
K acceptable. 

Theorem 4. If K is acceptable, 1 G /C(</>') and (j)' G C(O)\{0} then K.{4>) = 
C{D). 

Proof. We establish this fact by showing that there is no nontrivial linear func- 
tional L which has the property that 



L{g) = 0 



(3.21) 
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for every g G fC{<p), see for example [21]. Let c and w be as above. We choose 
b G M, y G D and g = (j){c ■ w + bK{-,y)). Now, differentiate both sides of 
equation (3.21) with respect to b and evaluate the resulting equation at 6 = 0 to 
obtain the equation 

L{<l>'{c-w)K{;y)) = 0, yG D. (3.22) 

On the other hand, differentiating (3.21) with respect to Cj+m, j G gives the 
equation 

L{(j)'{c ■ w)ki) = 0, £g Zr- (3.23) 

We shall use these equations in a moment. First, we observe that by hypothesis 
there exists a t G IR such that yf 0 and for every e > 0 there exists / G fC{(j)') 
given, for some to G IN, {aj : j G Z„} C M, {dj : j G Z„} C M’”, by the formula 

/ = ■ w) (3.24) 

such that — f \ < € on D. We now evaluate the equations (3.22) and (3.23) 

at c = d^ ,j G Z„ and combine the resulting equations to obtain 

L(/K(-, y)) = 0, yGB, L{fke) = 0, £ G Z,. 

We let M be a constant chosen big enough so that for all x and y G D, \K{x,y)\ < 
M, and \ki{x)\ < M,£ G Z^. We rewrite (3.22) in the form 

0 = L((f - <P'(t))K(; y)) + <P'(t)L(K(;y)) 

from which we obtain the inequalities 

\<j,'{t)L{K{;y))\<e\\L\\M, y G D, \<j>' {t)L{ki)\ < e\\L\\M, £gZ,. 

Since e is arbitrary we conclude for all y G H that L{{K{-,y)) = 0, y G D and 
L{ki) = 0, £ G Zr- Thus, using the Mercer representation for K we conclude, 
for all y G D, that 

(3.25) 

Next, we apply L to both sides of (3.25) and obtain that Aj |L(0,)P = 0 

which implies that L{<pj) = 0, j G Z+. However, since span{(l>j : j G Z+} = 
C(D), it follows that L = 0, which proves the result. □ 

We remark that the proof of this theorem yields for any / G C{D) the fact that 

d{f,icm<d{f,ic) = d{f,u). 

Note that if 4>{t) = t the hypothesis that 1 G K.{(j>') is automatically satisfied. 
We provide another sufficient condition for this requirement to hold. 



Lemma 3. If 1 G 1C and (j) G C(IR)\{0} then 1 G IC(<f)). 
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Proof. We choose some t G IR such that ^ 0 and some e > 0. There is a 
(5 > 0 such that whenever |t — s| < e, s G IR it follows that — </<(s)| < e. 
Since 1 G /C, there is a d G IR’”'*’’' and Dm CV so that \t — d-w\ < S uniformly 
on D. Hence it follows that — <f)(d • w)| < e uniformly on D which proves 
the result. □ 

As an example of the theorem above we choose D = [— tt, d G IN, = 
t, t G K, AT a 27 t— periodic translation kernel, that is, K{x,y) = h{x — y), x,y € 
D, where h : [— 7r,7r]‘^ — >■ IR is even, continuous, and 27 t— periodic, and r = 0. To 
ensure that AT is a reproducing kernel we assume h has a uniformly convergent 
Fourier series, 

h{x) = ^ a„cos(n-x), x G IR'^ (3.26) 

where a„ > 0, n G In this case we have the Mercer representation for K 
K{x,y) = E a„ sin(n • x) sin(n • y) + a„ cos(n • x) cos(n ■ y), x,y € IR”^ 

In addition, if a„ > 0 for all n G Zlj., the functions appearing in this represen- 
tation are dense in the 27 t— periodic functions in C{D), we conclude that tC is 
dense in C{D) as well. 

We remark that the method of proof of Theorem 4 can be extended to other 
function spaces, for instance spaces. This would require that (3.19) holds 
relative to the convergence in that space and that the set of functions {4>n : n G 
Z+} are dense in it. 

4 Learning Any Set of Finite Data: Interpolation 

In this section we discuss the possibility of adjusting the parameters in our model 
(1.4) to satisfy some prescribed linear constraints. This is a complex issue as it 
leads to the problem of solving nonlinear equations. Our observations, although 
incomplete, provide some instances in which this may be accomplished as well as 
an algorithm which may be useful to accomplish this goal. Let us first describe 
our setup. We start with the function 

/ ■= E ■ 9 ) 

iez„ 



where {a^- : j G Z„} C IR and {c-^ : j G Z„} C IR’” are to be specified by 
some linear constraint. The totality of scalar parameters in this representation 
is n{m -1-1). To use these parameters we suppose there is available data vectors 
{y^ : j G Z„} C IR™ and linear operators A® : C{D) — >• IR™, s G Zn that lead to 
the nonlinear equations 



E • a)) = y\ s G z, 



(4.27) 
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There are mn scalar equations here and the remaining degrees of freedom will 
be used to specify the Euclidean norm of the vectors ,j G Z„. We shall explain 
this in a moment. It is convenient to introduce for each s G the operator 
Bg : — >• defined for any c G by the equation 

Bg(c) = L^(<P(c-g)), sGZ„. (4.28) 

Therefore, the equations (4.27) take the form 

^ a.Bg(c’) = |/^ s G Z„. (4.29) 

jG^n 

Our first result covers the case n = 1. 

Theorem 5. If 4> G C(IR) is an odd function and Bq only vanishes on at 
0 then for any y^ G M*” and r > 0 there is a G K™ with = Tq and 

Oo G IR such that aoBo(c°) = 

Proof. We choose linearly independent vectors {w^ : j G Z^-i} C K™ perpen- 
dicular to and construct the map H : M™ — >■ by setting for c G IR™ 

H{c) := {w^ ■ Bo{c) : j G Z„_i). 

We restrict H to the sphere c - c = vq. Since H is an odd continuous map by the 
Borsuk antipodal mapping theorem, see for example [11], there is a c° G IR™ 
with c° • c° = Cq such that H{c^) = 0. Hence, Bo(c^) = uy^ for some scalar 
u G IR. Since Bq vanishes only at the origin we have that w yf 0 and, so, setting 
qq = u~^ proves the result. □ 

We remark that the above theorem extends our observation in (2.16). Indeed, 
if we choose (p := (pq and use the linear operator ^ IR™ defined for each 

f G as L^{f) := {{f,9j) '■ j G then the above result reduces to (2.16). 
However, note that Theorem 5 even in this special case is not proven by the 
analysis of a variational problem. 

We use Theorem 5 to propose an iterative method to solve the system of 
equations (4.29). We begin with an initial guess a° = (a° : j G Z„) and vectors 
{(. 1,0 . j ^ 2,n} with cl^ ■ cl° = rj,j G Z„. We now update these parameters 
by explaining how to construct = (aj : j G Z„) and vectors {c-1’^ : j G Z„}. 
First, we define Oq and c^I by solving the equation 

ajHo(c°’')+ ^ a°+iHo(c^+'’°) = J/°. 

j&n-l 

whose solution is assured by Theorem 5. Now, suppose we have found 
al, . . . c^I , . . . for some integer 1 < r < n — 1. We then solve 

the equation 



jGZn-r-1 



J 



•+1 
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for aj and until we reach r = n — 1. In this manner, we construct a sequence 
of vectors a* G M" and c-’’^ G IR'", fc G Z+, j G Z„ such that for all fc G Z+ and 
r G Z„, 

^ ^ f . (4.30) 

jeZr+i iGZ„_r_i 

We do not know whether or not this iterative method converges in the gen- 
erality presented. However, below we provide a sufficient condition for which the 
sequences generated above remain bounded. 

Corollary 2. If there is an s € Z„ such that whenever {c^ : j G Z„} C K™, 
b = {bj : j G Z„) G M" with d ■ c> > 0, j G Z„ and 

^ = 0 (4.31) 

ieZn 

it follows that 6=0, t 6 en t 6 e sequence {qj : j G Z„} defined in (4-28) is bounded. 



Proof. Without loss of generality we assume, by reordering the equations, that 
s = n — 1. The last equation in (4.30), corresponding to r = n — 1, allows us 
to observe that the coefficients : j G Z„} remain bounded during the 

updating procedure. To confirm this, we set 7 ^ = X)jez„ divide both 

sides of (4.30) by 7 ^,. If the sequence : k G IN} is not bounded we obtain, 

in the limit as fc — >■ 00 through a subsequence, that 

ajBs{c>) = 0 (4.32) 

iez„ 

where the constants dj, j G Z„ satisfy X)jez„ l®il = which in contradiction 
with our hypothesis. □ 

5 Discussion 

We have proposed a framework for learning in a Banach space and establish 
a representation theorem for the solution of regularization-based learning al- 
gorithms. This naturally extends the representation theorem in Hilbert spaces 
which is central in developing kernel-based methods. The framework builds on 
a link between regularization and minimal norm interpolation, a key concept in 
function estimation and interpolation. For concrete Banach spaces such as Orlicz 
spaces, our result leads to the functional representation (1.2). We have studied 
the density property of this functional representation and its extension. 

There are important directions that should be explored in the context pre- 
sented in this paper. First, it would be valuable to extend on-line and batch 
learning algorithms which have already been studies for finite dimensional Ba- 
nach spaces (see e.g. [1,9,26]) within the general framework discussed here. 
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For example, in [14] we consider the hinge loss function used in support vector 
machines and an appropriate H to identify the dual of the minimization problem 
(1.3) and report of our numerical experience with it. 

Second, it would be interesting to study error bounds for learning in Banach 
spaces. This study will involve both the sample as well the approximation error, 
and should uncover advantage or disadvantages of learning in Banach spaces in 
comparison to Hilbert spaces which are not yet understood. 

Finally, we believe that the framework presented here remains valid when 
problems (2.5) and (2.7) are studied subject to additional convex constraints. 
These may be available in form of prior knowledge on the function we seek to 
learn. Indeed constrained minimal norm interpolation has been studied in Hilbert 
spaces, see [15] and [5] for a review. It would be interesting to extend these idea 
to regularization in Banach spaces. As an example, consider the problem of 
learning a nonnegative function / in the Hilbert space % := L^{D) from the 
data {yj = f{t)gj{t)dt : j G Z^}. Then, any minimizer of the regularization 
functional of the form (1.3) in % (where Lj{f) := Jjj f{t)gj{t)dt) subject to 
the additional nonnegativity constraint, has the form in equation (1.2) where 
(p{t) = max(t,0), t G IR, see Theorem 2.3 in [15] for a proof. 
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Abstract. We present sharp bounds on the risk of the empirical mini- 
mization algorithm under mild assumptions on the class. We introduce 
the notion of isomorphic coordinate projections and show that this leads 
to a sharper error bound than the best previously known. The quantity 
which governs this bound on the empirical minimizer is the largest fixed 
point of the function ^n(r) = Esup{|E/ — E„/| : f G F, E/ = r}. We 
prove that this is the best estimate one can obtain using “structural 
results” , and that it is possible to estimate the error rate from data. We 
then prove that the bound on the empirical minimization algorithm can 
be improved further by a direct analysis, and that the correct error rate 
is the maximizer of — r, where = Esup{E/ — E„/ : f G F, 

Ef = r}. 

Keywords: Statistical learning theory, empirical risk minimization, gen- 
eralization bounds, concentration inequalities, isomorphic coordinate 
projections, data-dependent complexity. 



1 Introduction 

Error bounds for learning algorithms measure the probability that a function 
produced by the algorithm has a small error. Sharp bounds give an insight into 
the parameters that are important for learning and allow one to assess accu- 
rately the performance of learning algorithms. The bounds are usually derived 
by studying the relationship between the expected and the empirical error. It 
is now a standard result that, for every function, the deviation of the expected 
from the empirical error is bounded by a complexity term which measures the 
size of the function class from which the function was chosen. Complexity terms 
which measure the size of the entire class are called global complexity measures, 
and two such examples are the VC-dimension and the Rademacher averages 
of the function class (note that there is a key difference between the two; the 
VC-dimension is independent of the underlying measure, and thus captures the 
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worst case scenario, while the Rademacher averages are measure dependent and 
lead to sharper bounds). 

Moreover, estimates which are based on comparing the empirical and the 
actual structures (for example empirical vs. actual means) uniformly over the 
class are loose, because this condition is stronger than necessary. Indeed, in 
the case of the empirical risk minimization algorithm, it is more likely that the 
algorithm produces functions with a small expectation, and thus one only has to 
consider a small subclass. Taking that into account, error bounds should depend 
only on the complexity of the functions with small error or variance. Such bounds 
in terms of local complexity measures were established in [10,15,13,2,9]. 

In this article we will show that by imposing very mild structural assumptions 
on the class, these local complexity bounds can be improved further. We will state 
the best possible estimates which can be obtained by a comparison of empirical 
and actual structures. Then, we will pursue the idea of leaving the “structural 
approach” and analyzing the empirical minimization algorithm directly. The 
reason for this is that structural results comparing the empirical and actual 
structures on the class have a limitation. It turns out that if one is too close to 
the true minimizer the class is too rich at that scale and the structures are not 
close at a small enough scale to yield a useful bound. On the other hand, with 
the empirical minimizer one can go beyond the structural limit. 

We consider the following setting and notation: let T x be a measur- 
able space, and let P be an unknown probability distribution on X x y. Let 
((Xi, Fi), ..., {Xn, Yn)) £ {X X 3^)” be a finite training sample, where each pair 
(Xi,Yi) is generated independently according to P. The goal of a learning al- 
gorithm is to estimate a function h : X — >■ y (based on the sample), which 
predicts the value of Y given X. The possible choices of functions are all in a 
function class H, called the hypothesis class. A quantitative measure of how ac- 
curate a function h £ H approximates Y is given by a loss function I : y'^ — ^ M. 
Typical examples of loss functions are the 0-1 loss for classification defined by 
l(r, s) = 0 if r = s and /(r, s) = 1 if r s or the square-loss for regression tasks 
l{r,s) = (r — s)^. In what follows we will assume a bounded loss function and 
therefore, without loss of generality, I : y^ — >■ [—6, b]. For every h £ H we define 
the associated loss function Ih ■ {X x J^) — [—6,6], lh{x,y) = l{h{x),y) and de- 
note by F = {Ih : {X xy) — >• [—6, b] : h £ H} the loss class associated with the 
learning problem. The best estimate h* £ H is the one for which the expected 
loss (also called risk) is as small as possible, that is, E?/j* = inf/jg// TKlh, and we 
will assume that such an h* exists and is unique. We call F' = {Ih — lh* ■ h £ H} 
the excess loss class. Note that all functions in F' have a non-negative expecta- 
tion, though they can take negative values, and that 0 £ F'. 

Empirical risk minimization algorithms are based on the philosophy that it is 
possible to approximate the expectation of the loss functions using their empiri- 
cal mean, and choose instead of h* the function h £ H for which ^ X^r=i 
« infh^H ^ lh{xi,yi). Such a function is called the empirical minimizer. 

In studying the loss class F we will simplify notation and assume that F 
consists of bounded, real-valued functions defined on a measurable set X, that 
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is, instead of x we only write X. Let Xi, . . . ,X„ be independent random 
variables distributed according to P. For every f G F, we denote by 

- n 1 ^ 

P«/ = E„/= - V/(X,), P/ = E/, i?„/= - Va,/(X,), 

n n 

1—1 2=1 

where E/ is the expectation of the random variable f{X) with respect to P and 
(Ti, . . . ,cr„ are independent Rademacher random variables, that is, symmetric, 
{— 1, l}-valued random variables. We further denote 

\\P - Pn\\p= sup |E/ - E„/| , RnF = sup Rnf. 

feF feF 



The Rademacher averages of the class F are defined as E,RnF, where the ex- 
pectation is taken with respect to all random variables Xi and ai. An empirical 
version of the Rademacher averages is obtained by conditioning on the sample, 

/ ^ n 

E^RnF = E sup - aif(X^) 




Let 



Fr = {fGF:Ef = r}, = {/ G F : n < E„/ < r2}. 

For a given sample, denote by / the corresponding empirical risk minimizer, 
that is, a function that satisfies: E„/ = minjgi?E„/. If the minimum does not 
exist, we denote by / G F any p-approximate empirical minimizer, which is a 
function satisfying 

E„/ < inf E„/ + p, 

where p > 0. Denote the conditional expectation E(/(A)|Ai, . . . , A„) by E/. 

In the following we will show that if the class F is star-shaped and the 
variance of every function can be bounded by a reasonable function of its ex- 
pectation, then the quantity which governs both the structural behaviour of the 
class and the error rate of the empirical minimizer is the function 

^„(r) = E sup |E/ - E„/| = E ||F - F„||^ , 

/6F 

or minor modifications of ^n(r). Observe that this function measures the ex- 
pectation of the empirical process ||F — F„|l indexed by the subset F^. In the 
classical result, involving a global complexity measure, the resulting bounds are 
given in terms of E \\P — F„|| indexed by the whole set F, and in [10,15,13,2,9] in 
terms of the fixed point of E ||F — F„|| indexed by the subsets {/ G F : E/ < r} 
or {/ G F : E/^ < r}, which are all larger sets than F^. For an empirical min- 
imizer, these structural comparisons lead to the estimate that E/ is essentially 
bounded by r* = inf |r : ^ • This result can be improved further: we 

show that the loss of the empirical minimizer is concentrated around the value 
s* = argmax{^],(r) - r}, where $^(r) = Esup {E/ - E„/ : / G Fr}- 
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2 Preliminaries 

In order to obtain the desired results we will require some minor structural 
assumptions on the class, namely, that F is star-shaped around 0 and satisfies 
a Bernstein condition. 

Definition 1. We say that F is a {(3, B)- Bernstein class with respect to the 
probability measure P (where 0 < /3 < 1 and B >1), if every f € F satisfies 

E/2 < B(Ef)^. 

We say that F has Bernstein type (3 with respect to P if there is some constant 
B for which F is a {(3, B)- Bernstein class. 

There are many examples of loss classes for which this assumption can be 
verified. For example, for nonnegative bounded loss functions, the associated loss 
function classes satisfy this property with (3=1. For convex classes of functions 
bounded by 1, the associated excess squared-loss class satisfies this property as 
well with (3 = 1, a result that was first shown in [12] and improved and extended 
in [16,3] e.g. to other power types of excess losses. 

Definition 2. F is called star-shaped around 0 if for every f € F and 0 < a < 
1, af G F. 

We can always make a function star-shaped by replacing F with star(F, 0) = 
{af : / G F, 0 < Of < 1}. Although F C star(F, 0), one can show that the 
complexity measure ^„(r) does not increase too much. For star-shaped classes, 
the function (f is non-increasing, a property which will allow us to estimate 
the largest fixed point of ^n(r): 

Lemma 1. If F is star-shaped around 0, then for any 0 < ri < r 2 , 

$«(ri) ^ Cn(r2) 
ri ~ r2 ' 

In particular, if for some a, fnif) > ar then for all 0 < r' < r, fnir') > ar' . 

Proof: Fix r = (Ai,...,A„) and without loss of generality, suppose that 

supjTgp.^^ |E/ — E„/| is attained at /. Then /' = G F^^ satisfies 

|E/'-E„/'| = ^ sup |E/-E„/|. 

^ 2 



The tools used in the proofs of this article are mostly concentration inequal- 
ities. We first state the main concentration inequality used in this article, which 
is a version of Talagrand’s inequality [21,20,11]. 




274 P.L. Bartlett, S. Mendelson, and P. Philips 



Theorem 1. Let F be a class of functions defined on X and set P to he a 
probability measure such that for every f € F, ||/||oo < b and E/ = 0. Let 
be independent random variables distributed according to P and set 
= nsupj^g^ var [/]. Define 



n 

Z = snp'^ f{Xi), 

n 

Z = sup V/(x,) . 

Then there is an absolute constant K such that, for every x > 0 and every p > 0, 
the following holds: 

Pr ({z > (1 + p)EZ + + K{\ + p~^)bx^^ < e~^, 

Pr ({z < (1 - p)EZ - u'/K^ - K{1 + p-'^)bxYj < e~^ , 

and the same inequalities hold for Z. 

The inequality for Z is due to Massart [14]. The one sided versions were 
shown by Rio [19] and Klein [7]. For b = 1, the best estimates on the constants 
in all cases are due to Bousquet [6]. 

Setting Z = ||P — P„|]^ we obtain the following corollary: 

Corollary 1. For any class of functions F , and every x > 0, if 

A > Cmaxlfi ||P - P„|]^ ,CTFy'^, , (1) 

where ap = swpf^pvax [/] and b = swpf^p ||/||oo; then with probability at least 
1 — e~^ , every f in F satisfies 



|IE/-E„/|<A. 



This global estimate is essentially the result obtained in [8,1,18]. It is a worst- 
case result in the sense that it holds uniformly over the entire class, but 
E||P — Pn\\p is a better measure of complexity than the VC-dimension since 
it is measure dependent and it is well known that for binary valued classes, 
< cy^VC{F) fn. One way of understanding this result is as 
a method to compare the empirical and actual structure on the class addi- 
tively up to A. Condition (1) arises from the two extra terms in Talagrand’s 
concentration inequality. The result is sharp since it can be shown that for 
large enough n, E||P — P„|]^ > ap^/xjn, and that with high probability 
||P— P„||^ > cE||P — P„|[^ for a suitable absolute constant c, see e.g. [4]. 
Therefore, asymptotically, the difference of empirical and actual structures in 
this sense is controlled by the global quantity E ||P — P„||p., and the error rate 
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obtained using this approach cannot decay faster than 0{\/y/n). In particular, 
for any p-approximate empirical minimizer, if r satisfies the global condition of 
the theorem, then with probability at least 1 — e~^ , E/ < E„/ + p + r. 

The following symmetrization theorem states that the expectation of 
II P — P„||^ is upper bounded by the Rademacher averages of P, see for example 

[17]. 

Theorem 2. Let F be a class of functions defined on X, set P to he a probability 
measure on X and Xi, Xn independent random variables distributed according 
to P. Then, 

E||P-P„||^<2EP„P. 

The next lemma, following directly from a theorem in [5], shows that the 
Rademacher averages of a class can be upper bounded by the empirical Rade- 
macher averages of this class. The following formulation can be found in [2]. 



Theorem 3. Let F be a class of hounded functions defined on X taking values 
in [a,b], P a probability measure on X, and Xi,...,Xn be independent random 
variables distributed according to P. Then, for any 0 < a < 1 and x > 0, with 
probability at least 1 — e~^ , 



ERnF < E^RnF 

I — a 



(6 — a)x 
4na{l — a) 



3 Isomorphic Coordinate Projections 



We now introduce a multiplicative (rather than additive, as in Corollary 1) 
notion of similarity of the expected and empirical means which characterizes the 
fact that, for the given sample, for all functions in the class, |E/ — E„/| is at 
most a constant times its expectation. 

Definition 3. For r = (Xi, . . . , Xn), we say that the coordinate projection Ilr ■ 
f I— >■ (f{Xi), . . . ,f(Xn)) is an e-isomorphism if for every f € F, 

(1 - e)E/ < Enf < (1 + e)Ef. 

We observe that for star-shaped classes, if, for a given sample r, a coordinate 
projection Lin is an e-isomorphism on the subset F^, then the same holds for the 
larger set {/ G P : Ef > r}. 

Lemma 2. Let F he star-shaped around 0 and let r G T". For any r > 0 and 
0 < e < 1, the projection Iln is an e-isomorphism of F^ if and only if it is an 
e-isomorphism of {f € F : Ef > r}. 
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Proof: Let f € F such that E/ = t > r, and since F is star-shaped around 
5 = ff/t G P'r', hence, (1 — e)E/ < E„/ < (1 -I- e)E/ if and only if the same 
holds for ■ 

Thus, for star-shaped classes, it suffices to analyze this notion of similarity 
on the subsets Fr. The next result, which establishes this fact, follows from 
Theorem 1. It states that for every subset if ■Cn(f) is slightly smaller than 
r then most projections are e-isomorphisms on Fr (and by Lemma 2 also on 
{f G F : E/ > r}). On the other hand, if ^n(r) is slightly larger than r, most 
projections are not e-isomorphisms. Hence, at the value of r for which ^ r, 
there occurs a phase transition: above that point the class is small enough and a 
structural result can be obtained. Below the point, the class Fr, which consists 
of scaled down versions of all functions {f G F : E/ > r} and “new atoms” with 
E/ = r, is too saturated and statistical control becomes impossible. 



Theorem 4. There is an absolute constant c for which the following holds. Let 
F be a class of functions, such that for every f G F, ||/||oo < b. Assume that F 
is a {P, B)- Bernstein class. Suppose r > 0, 0 < e < 1, and 0 < a < 1 satisfy 



1. J/E||P 

2. //E||P 



r > c max ■ 



bx 



Bx 



l/(2-/3)' 



yna^e \na^e^ 

Pnllp > (1 -I- ajre, then 

Fr {Ilr is not an e-isomorphism of Fr} > 1 — e~^. 
Pn\\p < (1 — ce)re, then 

Fr {Ilr is an e-isomorphism of Fr} > 1 — e~^. 



Proof: The proof follows in a straightforward way from Theorem 1. Define Z = 
n ||P — P„||^ , set cr^ = n sup var [/] and note that Ilr is an e-isomorphism 
of Fr if and only if Z < ern. 

To prove the first part of our claim, recall that by Theorem 1, for every 
p,x > 0, with probability larger than 1 — e~^, 

Z > (1 — p)EZ — (j'J Kx — FT -I — ^ bx. 

To ensure that Z > ern, select p = a/(2(l -I- a)), and observe that by the 
assumption that P is a Bernstein class, it suffices to show that 

]^anre > (Bnr^Kx)^^^ -I- F -I- 
which holds by the condition on r. 

The second part of the claim also follows from Theorem 1: for every p,x > 0, 
with probability larger than 1 — e““, 

Z < (1 -I- p)EZ -I- aV Kx -I- F -I — ^ bx. 
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Choosing p = a/(2(l — a)), we see that Z < nre if 

-anre > (Bnr^Kx)^^^ + K (l + — — 

2 \ a 

so the condition on r again suffices. ■ 




Corollary 2. Let F be a class of functions hounded by b, which is star-shaped 
around 0 and is a {fd, B)- Bernstein class. Then there exists an absolute constant 
c for which the following holds. If 0 < e, a < 1, and r,x > 0, satisfy 



r > max 



fnjr) bx / Bx 

(1 — a)e’*"no;^e’*" yna'^e'^ J 



(2) 



then with probability at least 1 — e ^ , every f € F satisfies 



Ef < max 




Proof : The proof follows directly from Theorem 4. ■ 

Clearly, Corollary 2 is an improvement on the result in Corollary 1 for most 
interesting loss classes, for which 0 < /? < 1. The condition (2) allows one to 
control the expectation of the empirical minimizer asymptotically up to the scale 
and for classes with P = I even at the best possible scale 0(l/n), 
as opposed to 0{l/y/n) in Corollary 1. The quantity ^n(r) = E||P — P„||p is 
also an improvement on A ~ E ||P — P„||^ from Corollary 1, since the supremum 
is taken only on the subset F^ which can be much smaller than F. 

Corollary 2 also improves the localized results from [2] . In [2] the indexing set 
is the set of functions with a small variance, {f G F : Pf^ < r}, or a sub-root 
function upper bounding the empirical process indexed by {/ € P : Pf < r}. 
The advantage of Corollary 2 is that the indexing set F^ is smaller, and that the 
upper bound in terms of the fixed point can be proved without assuming the 
sub-root property. The property of in Lemma 1, a “sub-linear” property, 
is sufficient to lead to the following estimate on the empirical minimizer: 



Theorem 5. Let F be a {P,B)~ Bernstein class of functions hounded by b which 
is star-shaped around 0. Then there is an absolute constant c such that if 



r' = max < inf {r : f,n(T) 



cbx 



i") 



l/(2-/3)' 



then with probability at least 1 — e ^ , a p-approximate empirical minimizer f G F 
satisfies 



Ef < max{2p, r'}. 
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Proof: The proof follows from Corollary 2 by taking e = a = 1/2 and r = r'. In 
particular, Lemma 1 shows that if r' > inf {r : ^„(r) < |}, then ^n{r') < r'/4. 
Thus, with large probability, if / G F satisfies E/ > r', then E/ < 2E„/. Since 
/ is a p-approximate empirical minimizer and F is star-shaped at 0, it follows 
that E„/ < p, so either E/ < r' or E/ < 2p, as claimed. ■ 

Thus, with high probability, r* = inf {r : ^„(r) < |} is an upper bound for 
E/, as long as r* > cjn. 

This result holds in particular for any empirical minimizer of the excess loss 
class if the true minimizer /* exists. In this case, 0 G F, and any empirical 
minimizer over F is also an empirical minimizer over star(F, 0). 

Data-Dependent Estimation of ^ni^) and r* 

The next question we wish to address is how to estimate the function ^„(r) and 
the fixed point 

r* = inf |r : ^„(r) < ^| 

empirically, in cases where the global complexity of the function class, for exam- 
ple the covering numbers or the combinatorial dimension, is not known. 

To estimate r* we will find an empirically computable function ^„(r) which 
is, with high probability, an upper bound for the function ^n(j)- Therefore, it 
will hold that its fixed point f* = inf{r : ■jn(r) < |} is with high probability an 
upper bound for r*. Since ^n{T)/r will be a non-increasing function, we will be 
able to determine f* using a binary search algorithm. 

Assume that F is a star-shaped (/3, F)-Bernstein class and snp^^p ||/||oo < 
b. Let T = (Xi, Xn) be a sample, where each Xi is drawn independently 
according to P. 

From Theorem 4, for a = 1/2, e = 1/2, if r > cmax|^, and 

^n{r) < j, then with probability larger than 1 — e~^, every f £ Fr satisfies that 

V/gF,:E„/G . 

Since F is star-shaped, and by Lemma 1, it holds that $«(?') < | if and only 

if r > r*. Therefore, if r > max|r*, then with probability 

larger than 1 — e~^, F^ C F" 3 ^, which implies that 

2 ’ 2 

EaRn (Fr) < EcrF„ ) 

where F” = {/ G F : n < E„/ < r 2 }. 

By symmetrization (Theorem 2) and concentration of Rademacher averages 
around their mean (Theorem 3), it follows that with probability at least 1 — 2e~“, 

OX / \ T 

^„(r) < 2EF„(F,) < 4E,F„(F,) + — < 4E,F„ F? 3 ^ + -, 

n V 2 ’ 2 / c 
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where we used the fact that r > — (and clearly we can assume that c > 8). 
Set 

cbx l' 



r = max < r , , c , 

n \ n J 



, and 



R = 



1 2 



TH 



n 



[r'nj \bri\ 



n n 

Applying the union bound, and since |i?| < 6n + 1, with probability at least 
1 — 2{hn + l)e“®, inij) < 4Ecri?n 3 r ^ j for every r G i?. By Lemma 1, if 

r G [k/n, (fc + l)/n], then ^„(r) < (^) and thus, with probability at least 

1 — 2{hn + l)e“^, every r G [r', b] satisfies 

where ci, C 2 are positive constants. We define therefore 

= 8Eo-i?„ (A'cjr,C2r) A “■ 

Then it follows that with probability at least 1 — 2{bn + l)e~^ 

Vr G [r',b] : ^„(r) < |„(r) . 



Let r* = inf{r : ^n(r) < |}, then we know that with probability at least 
1 — 2(6n + 1)6“"^, f* > r*. Since ^n{r)/r is non-increasing, it follows that r > f* 
if and only if |„(r) < 

With this, given a sample of size n, we are ready to state the following 
algorithm to estimate the upper bound on f* based on the data: 

Algorithm RSTAR(F, Xi, . . . , X„) 

Set XL = 0, rj{ = b. 
in {r r) < tr/4 then 
for / = 0 to |"log 2 6n] 
set r = ; 

if ^„(r) > rjA then set = r, 
else set tr = r. 

Output f = Tr. 

By the construction, f — ^ < f* < f. For every n and every sample, with 
probability larger than 1 — 2{bn + l)e“®, r* < r. 

Theorem 6. Let F be a {13, B)- Bernstein class of functions bounded by b which 
is star-shaped around 0. With probability at least 1 — {2bn-\- 3)e“^, a p-approxi- 
mate empirical minimizer f € F satisfies 



Ef < max{2p, r"}. 
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where 



= max 




cbx 

,c 

n 




l/(2-/3)' 



and r = RSTAR{F, r) . 



RSTAR(F, t) is essentially the fixed point of the (Fcir,c 2 r)- This func- 

tion measures the complexity of the function class A”r,c 2 r which is the subset of 
functions having the empirical mean in an interval whose length is proportional 
to r. The main difference from the data-dependent estimates in [2] is that instead 
of taking the whole empirical ball, here we only measure the complexity of an 
empirical “belt” around r, since cir > 0. 

We can tighten this bound further by narrowing the size of the belt by re- 
placing the empirical set T;” 2 , 3 r /2 with The price we pay is 

an extra logn factor. 

With the same reasoning as before, by Theorem 4 for a = 1/2, e = 1/logn, 
and since F is star-shaped, then, if r > max -|r*, 

with probability larger than 1 - e““, F^ C F^-r/ iogn,r+r/ logn- We define 



in{r) ^4Ecri?n (T’^/n-fc/(nlogn),fe/n-|-fc/(nlogn)) + g^lognj k 



n 



if r G [/c/n, (fc-l- l)/n]. Again, with probability at least 1 — 2(6n-|-l)e it holds 
that for all r G [r',b] : ^(r) < ■jn(r), where 



, I cbxlogn f Bxlog‘‘ n 

r = max < r , , c 



n 



n 



Since ^n{f)/r is non-increasing, we can compute 



f* = inf r : |„(r) < 



2 logn 



with a slight modification of RSTAR (we replace the test in the if-clause, 
^„(r) > r/4, with ^„(r) > r/21ogn). For every n and every sample of size 
n, with probability larger than 1 — 2(bn + l)e“^, r* < f. 



4 Direct Concentration Resnlt for Empirical Minimizers 

In this section we will now show that a direct analysis of the empirical minimizer 
leads to sharper estimates than those obtained in the previous section. We will 
show that E/ is concentrated around the value s* = argmax{^(j(r) — r}, where 



C/(r) = E sup {E/ - E„/ : / G F, E/ = r} . 
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To understand why it makes sense to expect that with high probability E/ ~ 
s*, fix one value of r such that — s* > ^^(r) — r. Consider a perfect 

situation in which one could say that with high probability, 

C{r) ^ sup{E/ -E„/ : / G F, E/ = r} = r - inf {E„/ : / G F, E/ = r} . 

(Of course, this is not the case, as Talagrand’s inequality contains additional 
terms which blow-up as the multiplicative constant represented by ^ tends to 
one; this fact is the crux of the proof.) In that case, it would follow that 

-inf {E„/ : / G F, E/ = s*} > -inf {E„/ : / G F, E/ = r} 

and the empirical minimizer will not be in F^. In a similar manner, one has to 
rule out all other values of r, and to that end we will have to consider a belt 
around s* rather than s* itself. 

For e > 0, define 



: + = sup |o < r < & : ^;(r) - r > sup (^^(s) - s) - e| , 
= inf |o < r < 6 : ^^(r) - r > sup (^^(s) - s) - e| . 



The following theorem is the main result: 



Theorem 7. For any c\ > 0, there is a constant c (depending only on c\) such 
that the following holds. Let F be a {(3, B)- Bernstein class that is star-shaped at 
0. Define re,+, and r^,- as above, and set 



r = max 



inf {r : C{r) < ?'/4} , 



cb{x log n) ( B{x-\- log n) 



l/(2-/3)' 



For 0 < p < r'/2, let f denote a p- approximate empirical risk minimizer. If 

e > c (max {»p (C(») - ») , (FthFlfisd) + ,, 



then 

1. With probability at least 1 — e~^ , 



2. If 



E/ < max<^ -,re,+ } ■ 
n 



^;(0, ci/n) < sup - s) - e, 



then with probability at least 1 — e 

E/ 
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Note that this result is considerably sharper than the bound resulting from 
Theorem 5, as long as the function ^^(r) — r is not flat. (This corresponds to no 
“signiflcant atoms” appearing at a scale below some ro, and thus, for r < tq, Fr 
is just a scaled down version of if Cn(^) — f is flat, the two bounds will be 
of the same order of magnitude.) 

Indeed, by Lemma 1, since $(i(r)/r is non-increasing, 

inf {r ■■ < r} < inf |r : C(r) < • 

Clearly, ^(j(r) > 0, since ^„(r) > E(E/ — E„/) = 0 for any flxed function, 
and thus 0 < s* < inf{r : ^'„(r) < r} < r*. The same argument shows that if 

— r is not “flat” then s* <C r. Now, for /3 = 1, e ^ ^nd re__|_, 

will be of the order of s* . 

5 Discussion 

Now, we will give an example which shows that, for any given sample size n, we 
can construct a function class and a probability measure such that the bound on 
the empirical minimizer differs significantly when using r* from Section 3 versus 
s* from Section 4. 

We first prove the existence of two types of function classes, which are both 
bounded and Bernstein. 

Lemma 3. For every positive integer n and all m > 2(n^ -I- n), the following 
holds. If P is the uniform probability measure on {1, ...,m}, then for every ^ < 
A < 1/2 there exists a function class G\ such that 

1. For every g £ G\, —1 < g{x) < 1, Eg = A and Eg^ < 2Eg. 

2. For every set r C {1, ..., to} with |r| < n, there is some g £ G\ such that for 
every i £ t, g{i) = — 1. 

Also, there exists a function class H\ such that 

1. For every h G H\, 0 < h{x) < 1, IKh = A. 

2. For every set r C {1, ...,to| with |r| < n, there is some h £ H\ such that 
for every i £ t, h{i) = 0. 

Proof: The proof is constructive. Let J C {1, ...,to}, | J| = n; for every I C J 
define g = gjj in the following manner. For i £ I, set g(z) = 1, if i G J\I, set 
g(f) = — 1, and for i ^ J put g{i) = t, where 

^ ^ Ato-H |J\/| - \I\ 

m — n 

Observe that if to > n‘^+2n, then 0 < f < 2A < 1 for every I, J. By the definition 
of t, Eg/_j = A, and 

Eg2 = 1 (|/| - I J\/| + eim - n) + 2| J\/|) < Eg + 

m ' m 

Tl 1 

< Eg -|- 2 — < Eg H — < 2Eg, 

TO n 

where the last inequality holds because Eg = A > 1/n, and to > 2n^. 
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The second property of G\ is clear by the construction, and the claims re- 
garding H\ can be verified using a similar argument. ■ 

Given a sample size n, we can choose a large enough m and the uniform prob- 
ability measure P on m}, and define the function class F = star(i^, 0), 

where F = i?i /4 U Gi/„ from Lemma 3. F is star-shaped and (1,2) Bernstein. 

Theorem 8. // 0 < <5 < 1 and n > Nq{ 5), then for any corresponding F = 
star{F, 0) as above, the following holds: 

1. For every Xi , ..., X„ there is a function f € F with E/ =1/4 and E„/ = 0. 

2. For the class F, the function satisfies 

r (n -I- l)r if0<r< l/n, 

Cn(^) = \ ?■ if^/n < r < 1/4, 

[ 0 ifr> 1/4. 

Thus, inf {r > 0 : < r/4} = 1/4. 

3. If f is a p- approximate empirical minimizer, where 0 < p < 1/8, then with 
probability larger than 1 — <5, 

n \ V n j n 

The proof can be found in [4]. 
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Abstract. We consider the binary classification problem. Given an i.i.d. 
sample drawn from the distribution of an A x {0, l}-valued random pair, 
we propose to estimate the so-called Bayes classifier by minimizing the 
sum of the empirical classification error and a penalty term based on 
Efron’s or i.i.d. weighted bootstrap samples of the data. We obtain ex- 
ponential inequalities for such bootstrap type penalties, which allow us 
to derive non-asymptotic properties for the corresponding estimators. In 
particular, we prove that these estimators achieve the global minimax 
risk over sets of functions built from Vapnik-Chervonenkis classes. The 
obtained results generalize Koltchinskii [12] and Bartlett, Boucheron, Lu- 
gosi’s [2] ones for Rademacher penalties that can thus be seen as special 
examples of bootstrap type penalties. 



1 Introduction 

Let {X, Y) be a random pair with values in a measurable space S' = df x {0, 1}. 
Given n independent copies (Ai, Yi), . . . , (A„, y„) of (X,Y), we aim at con- 
structing a classification rule that is a function which would give the value of 
Y from the observation of X. More precisely, in statistical terms, we are in- 
terested in the estimation of the function s minimizing the classification error 
P [t(A) yf Y] over all the measurable functions t : X ^ {0, 1}. The function s is 
called the Bayes classifier and it is also defined by s{x) = IIp[v=i|x=s]>i/2- 
Given a class S of measurable functions from X to {0, 1}, an estima- 
tor s of s is determined by minimization of the empirical classification error 
7„(t) = n~^ over all the functions t in S. This method has 

been introduced in learning problems by Vapnik and Ghervonenkis [25]. How- 
ever, it poses the problem of the choice of the class S. To provide an estimator 
with classification error close to the optimal one, S has to be large enough so 
that the error of the best function in S is close to the optimal error, while it 
has to be small enough so that finding the best candidate in S from the data 
(Ai, Yi), . . . , (A„, Y„) is still possible. In other words, one has to choose a class 
S which achieves the best trade-off between the approximation error and the 
estimation error. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 285—299, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 
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One approach proposed to solve this question is the method of Structural 
Risk Minimization (SRM) initiated by Vapnik [27] and also known as Complex- 
ity regularization (see [1] for instance). It consists in selecting among a given 
collection of functions sets the set S minimizing the sum of the empirical classi- 
fication error of the estimator s and a penalty term taking the complexity of S 
into account. The quantities generally used to measure the complexity of some 
class S of functions from X to {0, 1} are the Shatter coefficients of the associated 
class of sets C = {{x G X, t{x) = 1}, t £ S'} given by: 

for fc > 1, S(C,fc) = maxa;j,,„,a;fceA:| {{xi, ■ ■ -,Xk} n C, C £C}\, 

and the Vapnik-Chervonenkis dimension of C defined as: 

V{C) = oo if for all fc > 1, S(C, k) = 2’^, 

V{C) = sup {fc > 1, S(C, k) = 2’^} else. 

Considering a collection {Sm, xn G IM*} of classes of functions from X to {0, 1} 
and setting Cm = {{a^ G X, t{x) = 1}, t £ Smj for all m in IM*, Lugosi and 
Zeger [17] study the standard penalties of the form 

pen(m) = k\/ (logS(Cm, n^) -\-m)/n, 

which are approximately k' ^{V (Cm) logn -\- m)/n. By using an inequality due 
to Devroye, they prove that if the classes Cm are Vapnik-Chervonenkis classes 
(that is if they have a finite VC-dimension) such that the sequence (V (Cm))meiN* 
is strictly increasing, and if the Bayes classifier s belongs to the union of the S'm’s, 
there exists an integer k such that the expected classification error of the rule 
obtained by SRM with such penalties differs from the optimal error P [s(V) ^ Y] 
by a term not larger than a constant times (Ck) logn/n. This upper bound 
is optimal in a global minimax sense up to a logarithmic factor. Given a class 
S of functions from X to {0, 1} where C = {{x £ X, t{x) = 1}, t £ S'} is a VC- 
class with VC-dimension V{C), Vapnik and Chervonenkis [26] actually prove 
that there exist some constants k.\ and H 2 such that for any classification rule s 
with classification error Lg, 

suppgggP [Lg - P [s(V) ^ Y]] > Kiy^V{C)jn, Vn > K 2 V{C). 

We explain in the next section how the choice of the penalty terms is con- 
nected with the calibration of an upper bound for the quantity sup(g 5 | 7 „(t) — 
P [t(V) yf V] |. Unfortunately, in addition to the fact that their computation 
is generally complicated, the penalties based on the Shatter coefficients or the 
VC-dimensions have the disadvantage to be deterministic and to overestimate 
this quantity for specific data distributions. This remark has led many authors 
to introduce data-driven penalties (see for example [6], [15], [5]). Inspired by 
the method of Rademacher symmetrization commonly used in the empirical 
processes theory, Koltchinskii [12] and Bartlett, Boucheron, Lugosi [2] indepen- 
dently propose the so-called Rademacher penalties. They prove oracle type in- 
equalities showing that such random penalties provide optimal classification rules 
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in a global minimax sense over sets of functions built from Vapnik-Chervonenkis 
classes. Lozano [14] gives the experimental evidence that, for the intervals model 
selection problem, Rademacher penalization outperforms SRM and cross valida- 
tion over a wide range of sample sizes. Bartlett, Boucheron and Lugosi [2] also 
study Rademacher penalization from a practical point of view by comparing it 
with other kinds of data-driven methods. 

Whereas the methods of Rademacher penalization are now commonly used in 
the statistical learning theory, they are not so popular yet in the applied statistics 
community. In fact, statisticians often prefer to stick with resampling tools such 
as bootstrap or jacknife in practice. We here aim at making the connection 
between the two approaches. We investigate a new family of penalties based on 
classical bootstrap processes such as Efron’s or i.i.d. weighted bootstrap ones 
while attending to placing Rademacher penalties among this family. 

The paper is organized as follows. In Section 2, we present the model selection 
by penalization approach and explain how to choose a penalty function. We 
introduce and study in Section 3 some penalties based on Efron’s bootstrap 
samples of the observations. We establish oracle type inequalities and, from a 
maximal inequality stated in Section 5, some (global) minimax properties for the 
corresponding classification rules. Section 4 is devoted to various symmetrized 
bootstrap penalizations: similar results are obtained, generalizing Koltchinskii 
and Bartlett, Boucheron, Lugosi’s ones. We finally give in Section 6 a discussion 
about these results. 

2 Model Selection 

We describe here the model selection by penalization approach to construct 
classification rules or estimators of the Bayes classifier s. In the following, we 
denote by S the set of all the measurable functions t : T — >■ {0, 1} and by P the 
distribution of {X,Y). Given a countable collection {Sm,iTi G M} of classes of 
functions in S (the models) and > 0, for any m in A4, we can construct some 
approximate minimum contrast estimator Sm in Sm satisfying: 

7n(Sm) < ^inf 7„(t) -I- Pn/2. 

We thus obtain a collection {sm,m G M} of possible classification rules and at 
this stage, the issue is to choose among this collection the “best” rule in terms 
of risk minimization. Let I be the loss function defined by: 

l{u,v) = IE - K{x)^y] , for all u,v in S. 

Notice that, by definition of s, l{s,t) is nonnegative for every t in S. The risk 
of any estimator Sm of s is given by IE [l{s, Sm)]- Ideally, we would like to select 
some element m (the oracle) in M minimizing 

IE [l{Sj s^)] = ^(s, S'ui) IE [^(s^, , 

where for every m in A4, Sm denotes some function in such that l{s, Sm) = 
inftg 5 ^l(s, t). However, such an oracle fh necessarily depends on the unknown 
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distribution of {X,Y). This leads us to use the method of model selection by 
penalization which originates in Mallows’ Cp and Akaike’s heuristics. 

The purpose of this method is actually to provide a criterion which allows to 
select, only from the data, an element m in A4 mimicking the oracle. Considering 
some penalty function pen : A4 — >■ IR+, we choose m such that: 

7 n(sm) + pen(m) < inf { 7 „(s™) + pen(m)} + p„/2, 

m^Ai 

and we finally take as “best” rule the so-called approximate minimum penalized 
contrast estimator 

^ ~ ^7h’ ( 1 ) 

We then have to determine some penalty function such that the risk of the 
approximate minimum penalized contrast estimator s is of the same order as 

[^(s, SjTi)] — inf^^^{/(s, Sm) T IE ^m)]} 

or, failing that, at most of the same order as infmg_vi {^(s, Sm) + \ZKn/n} when for 
each TO in A4, Sm = {Ec) C £ Cm}, Cm being a VC-class with VC-dimension Vm- 
Indeed, as cited in the introduction, Vapnik and Chervonenkis [26] proved that 
the global minimax risk over such a class Sm defined by inf^supp^^gg^IE [l{s, s)j 
is of order ^/Vmjn as soon as n > KVm, for some absolute constant k. 

The various strategies to determine adequate penalty functions rely on the 
same basic inequality that we present below. Let us fix to in Af and introduce the 
centered empirical contrast defined for all t in 5 by Yn{t) = 7 n(t) — IE [Ei(jc)/v] • 
Since 

l{^m, ^m) — ^n(^m) ^n(^m) “t” 

by definition of to and Sm, it is easy to see that 

l(s, s') < l(s, Sm) + Yn{sm) + pen{m) - %:(sm) ~ pen(m) + p„ (2) 

holds whatever the penalty function. Looking at the problem from a global 
minimax point of view, since IE \^{sm)\ = 0, it is then a matter of choosing a 
penalty such that pen(TO) compensates for —jnism) and such that IE [pen(TO)] 
is of order at most \/Vmln in the VC-case. Hence, we need to control 
uniformly for t in Sm and to in Af or sup^gg^ (~^(^)) uniformly for to in Af, 
and the concentration inequalities appear as the appropriate tools. 

Since we deal with a bounded contrast, we can use the so-called McDiarmid’s 
[22] inequality that we recall here. 

Theorem 1 (McDiarmid). Let Ai,...,A„ be independent random variables 
taking values in a set A, and assume that (f> : A” — >■ IR satisfies: 

sup 4i{xi, . . .,Xi-i,x'i,Xi+i ,. . . ,a;„)| < c*, 

for all i £ {1, . ■ ■ , n}. Then for all x > 0, the two following inequalities hold: 

. . . , A„) > IE [<^(Ai, . . . , A„)] + cr] < exp ( - 2xVE”=iCi)- 
P[(/)(Ai, . . . , A„) < IE [<^(Ai, . . . , A„)] - cr] < exp ( - 2xVE”=iCi)- 
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We can thus see that for all m in (—7n(t)) concentrates around its 

expectation. A well-chosen estimator of an upper bound for IE[supjgg^ (— %l(t))], 
with expectation of order \/l4n/n in the VC-case, may therefore be a good 
penalty. 

In this paper, we focus on penalties based on weighted empirical processes. 
The ideas developed here have been initiated by Koltchinskii [12] and Bartlett, 
Boucheron, Lugosi’s [2] works. 

Let denote the sample (Xi,Vi), . . . , (X„,V„). Starting from the sym- 
metrization tools used in the empirical processes theory, Koltchinskii [12] and 
Bartlett, Boucheron and Lugosi [2] propose a penalty based on the random 
variable = 2IE[supigs^n“^ X^r=i |?"] > where is a se- 

quence of independent identically distributed Rademacher variables such that 
P [ci = 1] = P [si = — 1] = 1/2 and the e/s are independent of More pre- 
cisely, they take A4 = IN* and they consider the minimum penalized contrast 
estimator s given by (1) with pen(m) = Rm + cn/logm/n, for some absolute, 
positive constant Ci. Setting Lj = P [t{X) ^ Y], they prove that there exists 
some constant C 2 > 0 such that 

P [Ls] < inf I inf L* -|- P [pen(m)] I -I- -b 

m£M J vn 

which can be translated in terms of risk bounds as follows: 

P[^(s, s)]< inf {l(s, Sm) + P [pen(m)]} -b -b p„. 

m£M y/n 

Moreover, it is well known (see [19] for instance) that if the collection 
{Sm,'ni&M} of models is taken such that each Cm = {{a^ G X,t{x) = l},t G 
5^} is a VC-class of subsets of X with VC-dimension Vm, then P[Rm] is of 
order \/Ymln. 

Our purpose is to extend this study by investigating penalty functions based 
on random variables of the form P[supjgg^n“^ X^r=i |ff] ) with var- 

ious random weights Zi, . . . , Z„. 

To avoid dealing with measurability issues, we assume that all the classes of 
functions considered in the paper are at most countable. 

3 Efron’s Bootstrap Penalization 

Setting = (Xi, Yi) for all i in {1, ... , n}, let P„ be the empirical process associ- 
ated with the sample = (^i, . . . ,^„) and defined by Pn{f) = /(?*)• 

Let P{f) = P [f{X,Y)] . For every m in Xi, denote by Pm the class of func- 
tions {/ : S' — >• {0, l},f{x,y) = JLt{x)^y,t G Sm}- As explained above with (2), 
we determine an adequate penalty function by controlling sup^gg (~7n(t)) = 
sup j^y7^(P—Pn){f) uniformly for m in At. Since McDiarmid’s inequality allows 
to prove that each supremum concentrates around its expectation, we only need 
to estimate P[sup^gjg^(P— P„)(/)]. Introduce now the Efron’s bootstrap sample 
a.i = . . . 7c,n = given by 
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where (J7i, . . . , Un) is a sample of n i.i.d. random variables uniformly distributed 
on ]0, 1[ independent of Denote by the corresponding empirical process. 

According to the asymptotic results due to Gine and Zinn 
[10], we can expect that IE[sup^gj^^ (P — P„)(/)j is well approxi- 
mated by IE[supygjp-^ (P„ — (/)|^"j. In fact, starting from the 
observation that IE[supygj^^ (P„ — P,J) (/)|^"j can be written as 
TP[supf^jr^n~^J2i=ii^-^n,i)fdi)\Q], where (M„4 , . . . , M„,„) is a multino- 
mial vector with parameters (n,n~^ , . . . ,n~^), using McDiarmid’s inequality 
again, we can obtain an exponential bound for 

IE [sup (P- P„) (/)] - 2eIE [sup (P„ - P„') (/) | C?] , 

and a fortiori for (sup^g^^^ (P- P„) (/) - 2eIE [sup (P„ -PM\^i])- 

Proposition 1. Let T he some countable set of measurable functions from S 
to [0, 1] . For any x > Q, the following inequality holds: 

F sup(P-P„)(/) - 2eE[sup(P„-P„')(/) H > ^\U~ < e""- 
/e.7=- Ljgjg J V 2n 

Proof Let M„ = (M„4, . . . , M„,„) with = YTj=i^Uj(^](i-i)/n,i/n\ for all 
i € {1, . . . , n}. Mn is a multinomial vector with parameters (n, n~^, . . . , n~^) 
independent of and the bootstrap empirical process P^ can be written as: 
PnO) = Z”=i Mn,if{fi). By Jensen’s inequality, we get: 

r 1 1 r 1 " r l' 

E sup(P-P„)(/) <— --E snp-J2^ (B(/) - Cr 

IfeT J F [M„,i = 2[ ^ L J 

1 " 

< 2eE sup - V(P(/) - /(G))IIm„ ,=2 
1 " 

<2eE sup -J2(Pif) - 
^ 7^1 

r 1 ^ 

<2eJE E sup-y(P(/)-/(^0)(Af„,i-l)IlM„,=2 
_ [fer n ^ 

It is well known that if U and V are random variables such that for all g in a 
class of functions G, g{U) and g{V) are independent and E [g(G)] = 0, then 

E[supggg5(P)] < E[supggg(5(P) +5(E))]. (3) 

Since Mn is independent of for all / in P, conditionnally given Mn, 

EtiiP(f) - - 1 )Im „,,=2 and E”=i (^(/) " /(CO) (M„,. - 

are centered and independent. So, applying (3) conditionnally given M„, 
one gets: 

E [supjg^(P - P„)(/)] < 2eE [supjg^n"^X;r=i(^(/) “ ~ 1)] , 
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that is 

IE [sup/g^(P - P„)(/)] < 2eIE [sup/g^(P„ - P^)(/)] . (4) 

One can see by straightforward computations that the variable ■ ■ ■ ,^n) = 
supjTg^(P — Pn){f) — 2eIE[supygj^(P„ — Pn){f)\ii \ satisfies the assumptions of 
McDiarmid’s inequality with Ci = hjn for alH € {1, . . . , n}. We thus have: 

p[sup/g^(P - P„)(/) - 2eIE[sup/g^(P„ - P^)(/)|Cr] 

> IE[sup^g^(P - P„)(/) - 2e supjg;r(P„ - P^)(/)] + 5i/a:/2n < e~^ , 

and Proposition 1 follows from (4). 

From this bound, we can derive non-asymptotic properties for the minimum 
penalized contrast estimator obtained via an Efron’s bootstrap based penalty. 

Theorem 2. Let = (Xi,Yi), . . . , (X„,Y„) be a sample of n independent 
copies of a couple of variables (X,Y) with values in X x {0,1} and with joint 
distribution P. Let f* = (Xfi,Y*i), . . . ,{X*j^,Y*^^) be the Efron’s bootstrap 
sample defined for i in {1, . . . ,n} by: 

where (Pi, . . . , P„) is a sample ofn i.i.d. random variables uniformly distributed 
on [0,1] independent offjf. Let 

lu{t) = n-^YTi=iMxi)^Yi and -fi(t) = 

Consider a countable collection {Sm,nri € A4} of classes of functions in S and a 
family (xm)mGM of nonnegative weights such that for some absolute constant S, 
'TlimGM < PI- Introduce the loss function l{s,t) = P ~ IIs(x)/y] 

and assume that for each m in A4, there exists a minimizer Sm of l{s,.) over 
Sm- Choose the penalty function such that 

pen{m) = 2eIE sup (7„(t) - -i'f{t)) l^r + 

.teS m J V 

The approximate minimum penalized contrast estimator s given by (1) satisfies: 

I 7T 

IE[/(s, s)]< inf {l{s,Sm) + TE‘[pen{m)]} + —\ — + Pn- 

niGM 2 V 

Moreover, if for all m in Ai, Sm = (Icj C G Cm}, where Cm is a VC-class with 
VC-dimension Vm > 1; assuming that n > 4, there exists some positive, absolute 
constant k such that 

IE[Z(s,s)]< inf f{s, Sm) + K ( -\-—log^ n-\- + p^, 
mGM \\n n \ n I \ 
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Comments: 

(z) The risk bounds obtained here are similar to the ones proved by Koltchinskii 
and Bartlett, Boucheron, Lugosi in the Rademacher penalization context. In 
particular, we have the following minimax result. 

Consider a collection {Sm, to G AI} of at most n classes of functions from X 
to {0, 1} such that for each to in M, Sm = {Ec) C* G Cm}, Cm being a VC-class 
with VC-dimension Vm > 1. If the Bayes classifier s associated with {X, Y) is in 
some Smo, the approximate minimum penalized contrast estimator s obtained 
from the above Efron’s bootstrap penalization satisfies: 

IE [l{s, s)] < k' + ]J-^ + + Pn- 

This implies that when logn < Vmg Si n/log^n holds and when p„ is at most 
s achieves, up to a constant, the global minimax risk over Smg- 
(a) The constant 2e in the expression of the penalty term is due to technical 
reasons, but all the experiments that we have carried out show that it is too 
pessimistic. These experiments indeed lead us to think that the real constant is 
about 1 and to take in practice a penalty equal to IE[supjg 5 ^ {ln{t) — 7^(t)) 1C"]- 

Proof. Let us prove the first part of Theorem 2. Recall that for any to in AI, s 
satisfies the inequality (2): 

/(s, s) < l{s, Sm) + %:(sm) + pen(m) - %:(sm) - pen(m) + p„, 

with ^(t) = 7 „(t) - IE [I((x) 5 ^y] • Let Bm = 2eIE[supigs^(7„(t) - 7^(^))|Cr]- 
Introduce a family (xm)mGM of nonnegative weights such that for some absolute 
constant X, ^PP^yiiig Proposition 1 with T = {{x,y) — >■ 

T^t{x)^y, t G S'm'} and X = Xm' + C for every to' in Al, we obtain that for all 
C > 0, except on a set of probability not larger than Ee~^, 

sup {~Yn{t)) < Bm' + 5\ ^ Vto' G M. 

tGS^, V 2n 

This implies that, except on a set of probability not larger than Ee~^, 

l{s, s) < l{s, Sm) + 7^(sm) + pen(TO) -I- Bm + 5w - pen(?7i) -I- -I- 5 

V 2n 

holds. Therefore, if pen(TO) > Bm + 5yixml‘2n, 

which leads by integration with respect to f to: 

IE 



{l{s, s) - l{s, Sm) - 7n(Sm) ~ pen(TO) - /9„)’ 



< 



5A 



p 



l{s, s) > l{s, Sm) + 7n{Sm) + pen(TO) + Pn + 5\j ^ 
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Since IE [7„(sm)] = 0, we obtain that 

5S I 7T 

IE [l{s, s)] < l{s, Sm) + IE [pen(m)] + ^ ^ 

which gives, since m can be taken arbitrarily in M, the expected risk bound. 

Let us now look for an upper bound for IE[i?m] when for all m in A4, 
Sm = {Ic 5 C G Cm}, Cm being a VC-class with VC-dimension Vm > 1- In view 
of Theorem 4, the main difficulty lies in the fact that the variables (1 — M„_i) are 
not independent. To remove the dependence, we use the classical tool of Pois- 
sonization. Let fV be a Poisson random variable with parameter n independent 
of and and for all z G {1, . . . , n}, TV* = IIc/,e](i-i)/n.i/"]- 

The Ni’s are independent identically distributed Poisson random variables with 
parameter 1 and we see that 

r 1 2e r ” 1 2e r " 

IE Bm S: — IE sup 4 IE \Nj ~ Mn j\ 

L I n J 

Since \Ni - M„,j| = \N - n\, we get: 

2c ^ 2c 

IE Bm < IE sup ^^(1 — A^i)IIj(jy H — . (5) 

L I n J 

Furthermore, the (1 — fVi)’s are i.i.d. centered real random variables satisfying the 
moments condition (7) with z; = 1 and c = 1 and Theorem 4 allows to conclude. 

4 Symmetrized Bootstrap Penalization 

Noting that the bootstrap empirical process satisfies Pn{f) = 

where (M„ i, . . ., M„ „) is a multinomial vector with 
parameters {n,n~] . . . ,n~^), Efron [8] suggests considering other ways to 
bootstrap. Let kF„ = {Wn,i, ■ ■ ■,Wn,n) denote a vector of n exchangeable 
and nonnegative random variables independent of the C’s and satisfying 
Yl=i tP„,i = n. Then P“(/) = n~^ Yl=i tEn,i/(C) defines a weighted bootstrap 
empirical process. Praestgaard and Wellner [23] obtain, for such processes, 
some results that extend the ones due to Cine and Zinn [10]. The best known 
and most often used example is the i.i.d. weighted bootstrap which is defined 
by Wn,i = VijVn, where V\, . ■ ■ ,^n are i.i.d. positive random variables and 
Vn = n~^ E}. This is the case in which we are interested in this section. 

With the same notations as in the previous section, from Praestgaard and 
Wellner’s results, we could expect that IE[sup^gj^^(P— P„)(/)] is sufficiently 
well approximated by a/IE [ p2] /Var [VijlE [sup {Pn — P™ ) | C”] > but we could 
not prove it in a general way. However, considering here the symmetrized 
bootstrap process (P“ — P“ ), where P“ is the i.i.d. weighted bootstrap 
process associated with an independent copy {V (, . . . , V^) of (Vi, . . . , Vn), allows 
us to use some symmetrization tools that generalize those cited in [12] and [2] 
and lead to the following result. 
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Proposition 2. Consider some countable set T of measurable functions from S 
to [0, 1]. Let Zi, . . . , Zn be a sequence of i.i.d. symmetric variables independent 
of f I and such that IE < +oo. For any x > 0, 

P sup (P-P„) (/) - |, P sup ^ Z^fifi) > 3 < e-P 

We can then get an exponential bound for 

sup (p-p„)(/) - sup cr , 

provided that the V^’s satisfy some moments conditions precised below. The 
same arguments lead furthermore to other kinds of penalties involving symmetric 
variables or symmetrized Efron’s bootstrap processes. 

Theorem 3 provides an upper bound for the risk of the approximate minimum 
penalized contrast estimators obtained via such penalties. 

Theorems. Assume that n > 4 and let fi={Xi,Yi), ... ,{Xn,Yn) be a sample 
of n independent copies of a couple of variables (X,Y) with values in X x {0, 1} 
and with joint distribution P. Let {Wn.i, ■ • . , W„^n), {Wf i, . . . , „) and rj de- 

fined by one of the three following propositions: 

1. For all i G {1, • ■ ■ ,n}, Wn,i = Vi, = V( and p = 1/E [|Pi — P/l], where 
Vi={V \, . . . , Vn) is a sample of n i.i.d. nonnegative random variables indepen- 
dent of and satisfying 

Vfc>2, E[|Pi|'=] < (6) 

forv>0 andc>0, (V(, . . . ,Vf) is a copy ofVf' independent ofVf' and fff . 

2. rj = 1 and for all i € {l,...,n}, Wn,i = Mn^i, WY = M'^^, where M„ = 

{Mnp, . . . , Mn^n) « multinomial vector with parameters {n, n~^, . . . , n~^) inde- 
pendent of and {M'j, . . . , M'^ „) is a copy of Mn independent of Mn and 

3. Foralli&{l,..., n}, Wn,i = Vi/V„ WY = V( /Vf and 7y = E [Pi] /E [|Pi -P/l], 

where P" = (Vi, . . . , P„) is a sample ofn i.i.d. positive random variables indepen- 
dent offi satisfying (6), {V {, . . . , Vff) is a copy ofVf' independent ofVf^ and 
Consider a countable collection {Sm,rn € A4} of classes of functions in S and a 
family (xm)mGM of nonnegative weights such that '^^aGM < X, for some 

absolute constant X. Introduce the loss function l{s,t) = E ~ IIs(x)/y] 

and assume that for each m in A4, there exists a minimizer Sm of l{s,.) over 
Sra- Choose a penalty function such that 

pen(m)=^E sup V(W„, 

The approximate minimum penalized contrast estimator s given by (1) satisfies: 

3.^7 /~7T~ ly 

E[Z(s,s)]< inf {l(s, Sm) + E[pen(m)] } + — +p„, 

mGM 2 \ zn yn 
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where v is some constant which may depend on v, c, IE [Vi] and IE [l^i — V(\] . 
Moreover, if for all m, &Cm\, where Cm is a VC-class with VC- 

dimension f^>l, there exists some positive constant v'{v,c,Wi\Vi],W\\Vi—Vl\\) 
such that 

IE[;(s, s)]< inf ll{s,Sm) + i^'(\f^+—'log'^n-\-i[^'\ 
meM I \ \ n n \ n I 




and if {Wn,i, • ■ • , Wn,n), i 5 ■ • • 5 Wf „) are defined as in the cases 1 or 3 with 



(Vi — V{) satisfying IE 



„A(Vi-Vi') 



< for any A > 0, 



IE[/(s,s)]< inf llis,Sm) + ( \f^ + \[^'\ 

meM I \ V n \ n I 

Comments: 

(i) The structure of the risk upper bound derived here is essentially the same as 
the bound achieved by the approximate minimum penalized contrast estimator 
considered in Theorem 2, so one can see in the same way that it is optimal in a 
global minimax sense over sets of functions based on VC-classes. 

(ii) As in Theorem 2, we shall also remark that the factor 2 in the penalty term, 
which comes from symmetrization inequalities, is pessimistic. A practical study 
actually shows that the real factor is closer to 1. 

(Hi) The subgaussian inequality IE [e^^] < e^ for all A > 0 is essentially 
satisfied by the Gaussian and Rademacher variables. We can then deduce from 
Theorem 3 Koltchinskii [12] and Bartlett, Boucheron, Lugosi’s [2] result about 
Rademacher penalization. 

Proof. The key point of the proof is the computation of an exponential inequality 
for 

- 2?? IE[suPigs^n"^X;r=i(WA,i - I?"] > 

in the three considered cases. 

For the first case, a direct application of Proposition 2 provides such an 
inequality. For the second case, as in (5) we can use Poissonization to remove 
the dependence between the (M„_j — MM)’s and to apply Proposition 2. The 
fact that IE[|A^i — 7V(|] > 1 for every independent Poisson variables A^i and N[ 
with parameter 1 finally leads to an appropriate inequality. For the third case, 
we still have to remove the dependence between the — j)’s. To do this, 

we notice that if Wm = 2 ? 7 lE[suptgs_^n"i E”=i(WA.i - I??] : 

-E[Vi]| 
'^] ■ 

Moreover, successive applications of the special version of Bernstein’s inequality 
proposed by Birge and Massart [4] lead to an exponential bound which gives by 
integration (see [9] for further details): 

IE [(Pi/K) -IE[Pi]|] < C(u,c,IE[Pi])/Vn. 



nIE [|Fi-F/|] 



E 



sup^ 



1 1 




4E 




-Wm 


< 



Fl \v 

a I " 



E[|Fi 
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We can then use Proposition 2. In all cases, we obtain for all x > 0, 
for all m G AI, 



2 r " 

F sup (-^(t)) - — E sup V(W„,i-VPA)Et(x,)/Y, Cr >3 
tes^ n 





where ly is a constant which may depend on v, c, E [Vi] and E[|Vi — y/|]. We 
conclude in the same way as in the proof of Theorem 2. The risk bound in the 
VC-case follows from Theorem 4. 



5 A Maximal Inequality 

Our purpose in this section is to provide a maximal inequality for 
weighted empirical processes. To do this, we first need the chaining re- 
sult stated below. We set for all a = (ai,...,a„) in F", ||a||| = 
For £ > 0 and A C F”, let H 2 {e,A) denote the logarithm 
of the maximal number N of elements in A such that for 

every 1,1' ||a« ||i > 

Lemma 1. Let A be some subset of [0,1]" and Zi,...,Zn i.i.d. centered real 
random variables. Let (5 > 0 such that swPoe. 4 ll®ll 2 ^ ^ assume that there 
exist some positive constants v and c such that the Zi ’s satisfy the condition: 

Mk>2, E (7) 

Then, one has 

n “I +00 / \ 

E sup^Oi^i <3^ <5v/72“-’Yi72 (2-b+i)5,A) -f c(2"^5 A 1)7/2 A 1 

J y ) 

and if for all A > 0, E 

The proof of this lemma is inspired by Lemma 15 in [20]. It is based on Birge and 
Massart’s [4] version of Bernstein’s inequality and Lemma 2 in [20] which follows 
from an argument due to Pisier. We can then prove the following theorem. 

Theorem 4. Let = (Xi,Yi), . . . , {X„,Y„) be a sample of n independent 
copies of a couple of variables (X,Y) with values in X x {0,1}. Lntroduce n 
i.i.d. real random variables Z\,...,Zn centered, independent of and sat- 
isfying the moments condition (7) for some positive constants v and c. Let 
Sm = {Ec)C' G Cm} where Cm is a VC-class with VC-dimension Vm and as- 
sume that n > 4. There exist some absolute constants ni and K 2 such that: 

1 " ^ p\f Y 

E - sup < Kiv^V — + «: 2 C— log^n, (8) 

n ’ V n n 
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and if for all A > 0, IE then 

IE [n~^ sup^^sJ£f^^^Zilt(Xi)^Y,] < Kii/ym/n. 

Proof. Considering Bm = {{{x,y) G df x {0, 1}, Ic(a;) y}, C G Cm} and the 
set Am = {(S.b{Xi,Yi),. ■ ■ AB{Xn,Yn)), B G Bm}, One has 

® [supt6s„Er=i^*2t(x,)^yj = IE [sup„g_4^x;r=ia*^*] • 

Moreover, sup£,g_ 4 ^ |la ||2 < y/n, and by definition of Am, for e > 0, 

H 2 {y/ne,Am) = H{s,Bm,Pn), where H{s,Bm,Pn) is the e— metric entropy of 
Bm with respect to the empirical measure For any proba- 

bility measure Q, the e— metric entropy H{e,Bm,Q) of Bm with respect to 
Q is the logarithm of the maximal number N of elements {6^)^ . . . ^ &(^)} in 
{Eb, B g Bm} such that for all 1,1' G N}, I yf I', IEq(6(') - > £2. 

Let us denote by H{e,Bm) the universal £— metric entropy of Bm that is 
H{e,Bm) = swpQH(s,Bm,Q), where the supremum is taken over all the proba- 
bilty measures on df x {0, 1}. For all j in IN, 

H2 < H{2-^^+^\Bm). 

Furthermore, since Bm is a VC-class with VC-dimension not larger than Vm, 
Haussler’s [11] bound gives: 

H{2-^^+^\Bm) < KVm (1 + (j + l)log2) Vj G IN, 



for some positive constant k. Hence, from Lemma 1 we get: 

+ 00 



IE 



sup ^ ZAt(^Xi)^Yi 
t&S, — 



i=l 



< 3 ^ (^y/vK2 ^ yJnVm (1 + (j + 1) log 2) 

+ck{ 2~^ y/n ^ l)Vm (1 + (j + l)log2)^. 



1=0 



which leads by some direct computations to the upper bound (8). The upper 
bound in the subgaussian case is obtained in the same way. 



6 Conclusion 

In this conclusion, we wish to point out that the theoretical results presented 
here do not allow to come out in favour of one of the investigated penalization 
schemes. In particular, as we consider the problem from the global minimax point 
of view, we can not decide between Rademacher and bootstrap type penalties. 

Nevertheless, it is now admitted that the global minimax risk is not an ideal 
bench mark to evaluate the relevance of classification rules, since it may over- 
estimate the risk in some situations. Vapnik and Chervonenkis’ [26] results in 
the so called zero-error case first raised this question. Devroye and Lugosi [7] 
then confirmed these reserves. They proved that for S = {S.c,C G C} where C 
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is a VC-class with VC-dimension V{C), setting Lt = P[t(V) ^ Y] and fixing L* 
in ]0, l/2[, there exist some constants Ki and K 2 such that for any classification 
rule s, if kiL*{ 1 — 2L*Y > V{C)ln, 

The localized versions of Rademacher penalization recently proposed by 
Koltchinskii and Panchenko [13], Bartlett, Bousquet and Mendelson [3] and Lu- 
gosi and Wegkamp [16] allow to construct classification rules satisfying oracle 
type inequalities with the appropriate dependence on L* . In the same spirit, 
we could introduce some localized bootstrap penalties. This would entail im- 
proving the inequality given in Proposition 1 under propitious conditions, for 
example when the classification error is small. Boucheron, Lugosi and Massart’s 
[5] concentration inequality seems to be the adequate tool, though it can not 
be directly applied because of the dependence between the weights involved in 
the bootstrap processes. Some refined Poissonization techniques may allow us to 
overcome this difficulty. 

However, by further analyzing the problem, Mammen and Tsybakov [18], 
Tsybakov [24] and Massart and Nedelec’s [21] works highlight the fact that one 
can describe the minimax risk more precisely for some pairs (V, Y) satisfying 
a prescribed margin condition. Massart and Nedelec [21] prove that if for every 
h £ [0, 1], V{h,S) denotes the set of the distributions P such that s £ S and 
(X,Y) satisfies the margin condition |2IE [V|V = a;] — 1] > h for all x in X, if 
2 < V(C) < n, 

infj suppgp(;,_S)lE[/(s,s)] > K 3 (v{C)/{nh) A \/V(C)/n) . 

In view of these works, a desirable goal would be to develop some estimation pro- 
cedures which lead to classification rules adapting better to the margin. Localized 
versions of Rademacher or bootstrap penalization may provide such procedures. 
But these methods essentially have a theoretical interest. 

We are hopeful that the connection made here between Rademacher penaliza- 
tion and the bootstrap approach, which takes advantage of its intuitive qualities, 
provides new lines of research towards more operational methods of construction 
of “margin adaptive” classification rules. 



Acknowledgements. The author wishes to thank Stephane Boucheron and 
Pascal Massart for many interesting and helpful discussions. 

References 

1 . Barron A.R. Logically smooth density estimation. Technical Report 56, Dept, of 
Statistics, Stanford Univ. (1985) 

2. Bartlett P., Boucheron S. and Lugosi G. Model selection and error estimation. 
Mach. Learn. 48 (2002) 85-113 

3. Bartlett P., Bonsquet O. and Mendelson S. Localized Rademacher complexities. 
Proc. of the 15th annual conf. on Computational Learning Theory (2002) 44-58 




Model Selection by Bootstrap Penalization for Classification 299 



4. Birge L., Massart P. Minimum contrast estimators on sieves: exponential bounds 
and rates of convergence. Bernoulli 4 (1998) 329-375 

5. Boucheron S., Lugosi G., Massart P. A sharp concentration inequality with appli- 
cations. Random Struct. Algorithm.s 16 (2000) 277-292 

6. Buescher K.L, Kumar P.R. Learning by canonical smooth estimation. I: Simultane- 
ous estimation, II: Learning and choice of model complexity. IEEE Trans. Autom. 
Control 41 (1996) 545-556, 557-569 

7. Devroye L., Lugosi G. Lower bounds in pattern recognition and learning. Pattern 
Recognition 28 (1995) 1011-1018 

8. Efron B. The jackknife, the bootstrap and other resampling plans. GBMS-NSF Reg. 
Gonf. Ser. Appl. Math. 38 (1982) 

9. Fromont M. Quelques problemes de selection de modeles : construction de tests 
adaptatifs, ajustement de penalites par des methodes de bootstrap (Some model 
selection problems: construction of adaptive tests, bootstrap penalization). Ph. D. 
thesis, Universite Paris XI (2003) 

10. Gine E., Zinn J. Bootstrapping general empirical measures. Ann. Probab. 18 (1990) 
851-869 

11. Haussler D. Sphere packing numbers for subsets of the Boolean n-cube with 
bounded Vapnik-Chervonenkis dimension. ,J. Comb. Theory A 69 (1995) 217-232 

12. Koltchinskii V. Rademacher penalties and structural risk minimization. IEEE 
Trans. Inf Theory 47 (2001) 1902-1914 

13. Koltchinskii V., Panchenko D. Rademacher processes and bounding the risk of 
function learning. High dimensional probability II. 2nd international conference, 
Univ. of Washington, DG, USA (1999) 

14. Lozano F. Model selection using Rademacher penalization. Proceedings of the 2nd 
ICSC Symp. on Neural Computation. Berlin, Germany (2000) 

15. Lugosi G., Nobel A.B. Adaptive model selection using empirical complexities. Ann. 
Statist. 27 (1999) 1830-1864 

16. Lugosi G., Wegkamp M. Gomplexity regularization via localized random penalties. 
Preprint (2003) 

17. Lugosi G., Zeger K. Concept learning using complexity regularization. IEEE Trans. 
Inf Theory 42 (1996) 48-54 

18. Mammen E., Tsybakov A. Smooth discrimination analysis. Ann. Statist. 27 (1999) 
1808-1829 

19. Massart P. Some applications of concentration inequalities to statistics. Ann. Fac. 
Sci. Toulouse 9 (2000) 245-303 

20. Massart P. Concentration inequalities and model selection. Lectures given at the St- 
Flour summer school of Probability Theory. To appear in Lect. Notes Math. (2003) 

21. Massart P., Nedelec E. Risk bounds for statistical learning. Preprint (2003) 

22. McDiarmid C. On the method of bounded differences. Surveys in combinatorics 
(Bond. Math. Soc. Lect. Notes) 141 (1989) 148-188 

23. Praestgaard J., Wellner J.A. Exchangeably weighted bootstraps of the general em- 
pirical process. Ann. Probab. 21 (1993) 2053-2086 

24. Tsybakov A. Optimal aggregation of classifiers in statistical learning. Preprint 

(2001) 

25. Vapnik V.N., Chervonenkis A.Ya. On the uniform convergence of relative frequen- 
cies of events to their probabilities. Theor. Probab. Appl. 16 (1971) 264-280 

26. Vapnik V. N., Chervonenkis A. Ya. Teoriya raspoznavaniya obrazov. Statisticheskie 
problemy obucheniya. Nauka, Moscow (1974) 

27. Vapnik V.N. Estimation of dependences based on empirical data. New York, 
Springer- Verlag (1982) 




Convergence of Discrete MDL for Sequential 

Prediction 



Jan Poland and Marcus Hutter 

IDSIA, Galleria 2, CH-6928 Manno (Lugano), Switzerland* 
{jan,marcus}@idsia. ch 



Abstract. We study the properties of the Minimum Description Length 
principle for sequence prediction, considering a two-part MDL estimator 
which is chosen from a countable class of models. This applies in par- 
ticular to the important case of universal sequence prediction, where the 
model class corresponds to all algorithms for some fixed universal Turing 
machine (this correspondence is by enumerable semimeasures, hence the 
resulting models are stochastic). We prove convergence theorems similar 
to Solomonoff ’s theorem of universal induction, which also holds for gen- 
eral Bayes mixtures. The bound characterizing the convergence speed for 
MDL predictions is exponentially larger as compared to Bayes mixtures. 
We observe that there are at least three different ways of using MDL 
for prediction. One of these has worse prediction properties, for which 
predictions only converge if the MDL estimator stabilizes. We establish 
sufficient conditions for this to occur. Finally, some immediate conse- 
quences for complexity relations and randomness criteria are proven. 



1 Introduction 

The Minimum Description Length (MDL) principle is one of the most important 
concepts in Machine Learning, and serves as a scientific guide, in general. In 
particular, the process of building a model for any kind of given data is governed 
by the MDL principle in the majority of cases. The following illustrating example 
is probably familiar to many readers: A Bayesian net (or neural network) is 
constructed from (trained with) some data. We may just determine (train) the 
net in order to fit the data as closely as possible, then we are describing the 
data very precisely, but disregard the description of the net itself. The resulting 
net is a maximum likelihood estimator. Alternatively, we may simultaneously 
minimize the “residual” description length of the data given the net and the 
description length of the net. This corresponds to minimizing a regularized error 
term, and the result is a maximum a posteriori or MDL estimator. The latter 
way of modelling is not only superior to the former in most applications, it is 
also conceptually appealing since it implements the simplicity principle, Occam’s 
razor. 

The MDL method has been studied on all possible levels from very concrete 
and highly tuned practical applications up to general theoretical assertions (see 

* This work was supported by SNF grant 2100-67712.02. 
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e.g. [1,2,3]). The aim of this work is to contribute to the theory of MDL. We 
regard Bayesian or neural nets or other models as just some particular class 
of models. We identify (probabilistic) models with (semi)measures, data with 
the initial part of a sequence Xi,X 2 , ■ ■ ■ and the task of learning with 

the problem of predicting the next symbol Xt (or more symbols). The sequence 
X\,X 2 , ■ ■ ■ itself is generated by some true but unknown distribution pL. 

An two-part MDL estimator for some string x = x\, . . . ,Xt~i is then some 
short description of the semimeasure, while simultaneously the probability of 
the data under the related semimeasure is large. Surprisingly little work has 
been done on this general setting of sequence prediction with MDL. In contrast, 
most work addresses MDL for coding and modeling, or others, see e.g. [4,5,6, 
7]. Moreover, there are some results for the prediction of independently identi- 
cally distributed (i.i.d.) sequences, see e.g. [6[. There, discrete model classes are 
considered, while most of the material available focusses on continuous model 
classes. In our work we will study countable classes of arbitrary semimeasures. 

There is a strong motivation for considering both countable classes and 
semimeasures: In order to derive performance guarantees one has to assume 
that the model class contains the true model. So the larger we choose this class, 
the less restrictive is this assumption. From a computational point of view the 
largest relevant class is the class of all lower-semicomputable semimeasures. We 
call this setup universal sequence prediction. This class is at the foundations of 
and has been intensely studied in Algorithmic Information Theory [8,9,10]. Since 
algorithms do not necessarily halt on each string, one is forced to consider the 
more general class of semimeasures, rather than measures. Solomonoff [11,12] 
defined a universal induction system, essentially based on a Bayes mixture over 
this class (see [13,14] for recent developments). There seems to be no work on 
MDL for this class, which this paper intends to change. What has been studied 
intensely in [15] is the so called one-part MDL over the class of deterministic 
computable models (see also Section 7). 

The paper is structured as follows. Section 2 establishes basic definitions. 
In Section 3, we introduce the MDL estimator and show how it can be used 
for sequence prediction in at least three ways. Sections 4 and 5 are devoted 
to convergence theorems. In Section 6, we study the stabilization properties of 
the MDL estimator. The setting of universal sequence prediction is treated in 
Section 7. Finally, Section 8 contains the conclusions. 

2 Prerequisites and Notation 

We build on the notation of [9] and [15]. Let the alphabet A be a finite set 
of symbols. We consider the spaces X* and X°° of finite strings and infinite 
sequences over X. The initial part of a sequence up to a time tGNort— I gN 
is denoted by x\-,t or x<t, respectively. The empty string is denoted by e. 

A semimeasure is a function w : X* — >■ [0, 1] such that 

^{e) < 1 and n{x) > ^ h’(xa) for all x € X* 

a&X 



( 1 ) 
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holds. If equality holds in both inequalities of (1), then we have a measure. Let C 
be a countable class of (semi)measures, i.e. C = {vi : i G /} with finite or infinite 
index set / C N. A (semi)measure v dominates the class C iff for all G C there 
is a constant c(t'i) > 0 such that iy{x) > c(i'i) ■ Vi{x) holds for all x G X* . The 
dominant semimeasure u need not be contained in C, but if it is, we call it a 
universal element of C. 

Let C be a countable class of (semi)measures, where each v gC \s associated 
with a weight > Q and Wi, < 1. We may interpret the weights as a prior 
on C. Then it is obvious that the Bayes mixture 



i{x) = C[C](a;) = ^ w^iy{x), x G A*, (2) 

i^eC 



dominates C. Assume that there is some measure p. G C, the true distribution, 
generating sequences x<oo G X°°. Normally p is unknown. (Note that we require 
/i to be a measure, while C may contain also semimeasures in general. This is 
motivated by the setting of universal sequence prediction as already indicated.) 
If some initial part x<t of a sequence is given, the probability of observing xt G X 
as a next symbol is given by 



p{xt\x<_t) 



p{x<_txt) 

Kx<t) 



if p{x<t) > 0 and p{xt\x<t) 



0 if p{x<^t) = 0. 



(3) 



The case p{x<^t) = 0 is stated only for well-definedness, it has probability zero. 
Note that p{xt\x<^t) can depend on x<t. We may generally define the quantity 
(3) for any function tp : X* -G [0, 1], we call ip{xt\xct) = the (p-prediction. 

Clearly, this is not necessarily a probability on X for general p. For a semimea- 
sure v in particular, the j^-prediction i>(-\x^t) is a semimeasure on X. 

We define the expectation with respect to the true probability p: Let n > 0 
and / : A" — >■ K be a function, then 



E / = E /(xi,„) = ^ p{xi,n)f{xi,n). ( 4 ) 

;n 

Generally, we may also define the expectation as an integral over infinite se- 
quences. But since we won’t need it, we can keep things simple. We can now 
state a central result about prediction with Bayes mixtures in a form indepen- 
dent of Algorithmic Information Theory. 

Theorem 1. For any class of (semi)measures C containing the true distribution 
p and any n > 1, we have 

n 2 

t=l a&X 



This was found by Solomonoff ([12]) for universal sequence prediction. A 
proof is also given in [9] (only for binary alphabet) or [16] (arbitrary alphabet). 
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It is surprisingly simple once Lemma 7 is known. A few lines analogous to (8) 
and (9) exploiting the dominance of ^ are sufficient. 

The bound (5) asserts convergence of the ^-predictions to the /r-predictions 
in mean sum (i.m.s.), since we define 

oo 2 

^ 3 C > 0 : <C. (6) 

i=l a^X 

Convergence i.m.s. implies convergence with /i-probability one (w./r-p.l), since 
otherwise the sum would be infinite. Moreover, convergence i.m.s. provides a rate 
or speed of convergence in the sense that the expected number of times t in which 
Lp{a\x^t) deviates more than e from /i(a|x<t) is finite and bounded by C/e^ and 
the probability that the number of e-deviations exceeds is smaller than 5. 
If the quadratic differences were monotonically decreasing (which is usually not 
the case), we could even conclude convergence faster than 

Probabilities vs. Description Lengths. By the Kraft inequality, each 
(semi)measure can be associated with a code length or complexity by means 
of the negative logarithm, where all (binary) codewords form a prefix-free set. 
The converse holds as well. E.g. for the weights Wv with ^ 1? codes of 

lengths [— log 2 W^] can be found. It is often only a matter of notational conve- 
nience if description lengths or probabilities are used, but description lengths 
are generally preferred in Algorithmic Information Theory. Keeping the equiva- 
lence in mind, we will develop the general theory in terms of probabilities, but 
formulate parts of the results in universal sequence prediction rather in terms of 
complexities. 

3 MDL Estimator and Predictions 

Assume that C is a countable class of semimeasures together with weights 
(w,y)^gc, and X G X* is some string. Then the maximizing element , often 
called MAP estimator, is defined as 

= Vwp = axgmax{w^h'{x)}. 

In fact the maximum is attained since for each s G (0, 1) only a finite number 
of elements fulfil Wi,u{x) > e. Observe immediately the correspondence in terms 
of description lengths rather than probabilities: = argminj^gC { ~ ^og 2 w{iz) — 

log 2 i^(a;)}. Then the minimum description length principle is obvious: mini- 

mizes the joint description length of the model plus the data given the model^ 

^ Precisely, we define a MAP (maximum a posteriori) estimator. For two reasons, 
information theorists and statisticians would not consider our definition as MDL in 
the strong sense. First, MDL is often associated with a specific prior. Second, when 
coding some data x, one can exploit the fact that once the model is specified, 
only data which leads to the maximizing element needs to be considered. This 
allows for a description shorter than log 2 i^“^(a;). Since however most authors refer to 
MDL, we will keep using this general term instead of MAP, too. 
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(see the last paragraph of the previous section) . As explained before, we stick to 
the product notation. 

For notational simplicity we set = u^{x). The two-part MDL estimator 

is defined by 

g{x) = g[c]{x) = w^xv^[x) = Tina,x{w,^v{x)}. 

So g chooses the maximizing element with respect to its argument. We may 
also use the version g’^(x) := w^yir^ix) for which the choice depends on the 
superscript instead of the argument. For each x,y £ X*, ^{x) > g{x) > g^{x) is 
immediate. 

We can define MDL predictors according to (3). There are at least three 
possible ways to use MDL for prediction. 



Definition 2. The dynamic MDL predictor is defined as 



g{a\x) 



g{xa) 

g{x) 



g^°-{xa) 

g^{x) 



That is, we look for a short description of xa and relate it to a short description 
of X = x^f We call this dynamic since for each possible a we have to find a new 
MDL estimator. This is the closest correspondence to the f -predictor. 



Definition 3. The static MDL predictor is given by 

g^{xa) g^{xa) 



'(a|x) = g^{a\x) = 



v^{xa) 



g(x) g^(x) v^{x) 



Here obviously only one MDL estimator g^ has to be identified, which may be 
more efficient in practice. 



Definition 4. The hybrid MDL predictor is given by g^^^{a\x) = This 

can be paraphrased as “do dynamic MDL and drop the weights”. Lt is somewhat 
in-between static and dynamic MDL. 

The range of the static MDL predictor is obviously contained in [0,1]. For 
the dynamic MDL predictor, this holds by g“^{x) > g^“(x) > g“^°‘{xa), while for 
the hybrid MDL predictor it is generally false. 

Static MDL is omnipresent in machine learning and applications. In fact, 
many common prediction algorithms can be abstractly understood as static 
MDL, or rather as approximations. Namely, if a prediction task is accomplished 
by building a model such as a neural network with a suitable regularization to 
prevent “overfitting”, this is just searching an MDL estimator within a certain 
class of distributions. After that, only this model is used for prediction. Dynamic 
and hybrid MDL are applied more rarely due to their larger computational ef- 
fort. For example, the similarity metric proposed in [17] can be interpreted as (a 
deterministic variant of) dynamic MDL. For hybrid MDL, we will see that the 
prediction properties are worse than for dynamic and static MDL. 
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We will need to convert our MDL predictors to measures on X by means of 
normalization. If (p : X* — >■ [0, 1] is any function, then 



PnormisA^<t) 



V{a\x<t) 

T.a'axV{a'\x<t) 



<p{x<ta) 

Y.a’(^xV{x<ta') 



(assume that the denominator is different from zero, which is always true with 
probability 1 if is an MDL predictor). This procedure is known as Solomonoff 
normalization ([12,9]) and results in Vnormixi-.n) = v{x\.,n) /\y{e)N^{x<^n)\, 
where 



Nu{x) 



i(x) + l 

n 






Y.aex^i^<ta) 

f"(a;<i) 



(7) 



is the normalizer. Before proceeding with the theory, an example is in order. 



Example 5. Let n G N, T = {1, . . . , n}, and 



C = ^ns{xi-,t) = ^ with 0 = G ([0, IjnQ)” : = l| 

i=l 



be the set of all rational probability vectors with any prior (ru^)^ge. Each -d G 0 
generates sequences x<oo of independently identically distributed (i.i.d) random 
variables such that P{xt = i) = 'di for all t > 1 and 1 < i < n. If xi-,t is the 
initial part of a sequence and a G 0 is defined by Oi = |{s < t : = *}|, then 

it is easy to see that 

= argmeg; {w(i?) • exp [ — t-D{a\\'d)] } , 

where D(o;||r9) = IT Kullhack-Leihler divergence. If |T| = 2, 

then 0 is also called a Bernoulli class, and one usually takes the binary alphabet 
X = M = {0, 1} in this case. 



4 Dynamic MDL 

We can start to develop results. It is surprisingly easy to give a convergence proof 
w.p.l of the non-normalized dynamic MDL predictions based on martingales. 
However we omit it, since it does not include a convergence speed assertion as 
i.m.s. results do, nor does it yield an off-sequence statement about g{a\x^t) for 
a Xt which is necessary for prediction. 

Lemma 6. For an arbitrary class of (semi)measures C, we have 
(z) q{x) - E g{xa) < f{x) — £,{xa) and 

(zz) g^{x) g^{xa) < f{x)-'^^{xa) 

for all X G X* . In particular, f — g is a semimeasure. 
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Proof. For all x € IF*, with f '■= f — g we have 

“ 9{xa)j < (f{xa) - g^{xa)'j 

d^/\^ d^/\^ d^/\^ 

= Y, ^ < Y, w^iy{x) = f{x) - g{x) = f{x). 

The first inequality follows from g^{xa) < g(xa), and the second one holds since 
all V are semimeasures. Finally, f{x) = ^(a;) — g{x) = X)i/eAi\{i^®} w^v{x) > 0 
and /(e) = ~ q{^) ^ 1- Hence / is a semimeasure. □ 



Lemma 7. Let fj, and fi he measures on X , then 



Y “ A(a))^ < Y 

dG X d^X 



/t(g) 

fi{a)' 



See e.g. [16, Sec. 3. 2] for a proof. 

Theorem 8. For any class of (semi)measures C containing the true distribution 
/i and for all n G N, we have 

n 

Y E (At(a|a:<i) - gnorm{a\x <t)) < w~^ + lnw~^. 
t=l aex 



That is, gnorm,{o\x^t) fi(^a\x<:t) (see (6)), which implies gnorm{a\x <t) 

/i(a|a;<t) with ^-probability one. 

Proof. From Lemma 7, we know 



E^E (^(a|x<t) - p„orm(a|a;<t))^ < E^E /r(a|x<t)ln 



t=i aex 



= E El" 



i=l aeX 



n{xt\xct) 

gnormixt |a^<i) 



= Ee 



^i{xt\x.^t) ^ Y.aex Q{x<to) 



g{xt\x^t) 



g{x<t) 



T{a\x<t) 

Hnorm(g|^<t) 

(8) 



Then we can estimate 

y Eln^^^ = E infT ^(^*|^<*) 

^ g[xt\x<t) fJi g[xt\x<t) 



E In 



F{xi:n) 

g{xi.,„) 



< lnw^\ 



(9) 



since always ^ Moreover, by setting x = x<t, using Inrt < m — 1, adding 

an always positive max-term, and finally using ^ < w~^ again, we obtain 



E ^^ Eqg(^<tg) ^^ niaQixa) 
g{x<t) [ g{x) 



- 1 



= E 



F{x)[{Y.aSixa)) - g{x) 

. , 9{x) 
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M(a^) (Eaex “ e{x) + max {O, g{x) - Y.a(^x ^i^a)} 



<w~^ E 



q{x) 



We proceed by observing 



(s g{xa)^ — g(x) + max |o, g(x) -E £l(xo)| 






• (10) 



n n 

E E ( E =E 






t—1 ^(x)—t i{x)—t — l 



{x)—n 



( 11 ) 

which is true since for successive t the positive and negative terms cancel. From 
Lemma 6 we know g(x) — 0 (xa) < <^( 2 ^) ~ $( 2 ^®) therefore 

n n 

EE max |o, g(x) — g(xa) }^EE max |o,^(a;) - E^(^“)} 

t=l £(x)=t-l aeX t=l £(x)=t-l aeX 



EE [^(^) “ E ^(^“) =?(e)- E ^(^)- 

£(x)—n 



( 12 ) 



i=l £(x)=t-l aeX 



Here we have again used the fact that positive and negative terms cancel for 
successive t, and moreover the fact that ^ is a semimeasure. Combining (10), 
(11) and (12), and observing ^ 1> we obtain 



V 

fbr e{x<t) 



< w., 



C(e) - Q{e) + (^^(®) - C(*)) 

i{^x)=n 



Therefore, (8), (9) and (13) finally prove the assertion. 



(13) 

□ 



This is the first convergence result in mean sum, see (6). It implies both 
on-sequence and off-sequence convergence. Moreover, it asserts the convergence 
is “fast” in the sense that the sum of the total expected deviations is bounded 
by w~'^ + lnrc“^. Of course, w~'^ can be very large, namely 2 to the power of 
complexity of /i. The following example will show that this bound is sharp (save 
for a constant factor). Observe that in the corresponding result for mixtures. 
Theorem 1, the bound is much smaller, namely lnw“^ = complexity of /i. 

Example 9. Let X = {0,1}, TV > 1 and C = (z^i, . . . , /ij. Each Vi is a 
deterministic measure concentrated on the sequence 1®“^0°°, while the true dis- 
tribution jjL is deterministic and concentrated on cc<oo = 1°°- Let = Wfj, = ^ 
for all i. Then g, generates cc<oo, and for each t < fV — 1 we have gnorm{Q\x <t) = 

l?norm (1 |2^<i ) 2’ HenCe, LI (^g(^a\x^i) ^?norm(a|^<t)) 2^^ 1) ~ 

for large N. Here, g is Bernoulli, while the Vi are not. It might be sur- 
prising at a first glance that there are even classes C containing only Bernoulli 
distributions, where the exponential bound is sharp [18]. 
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Theorem 10. For any class of (semi)measures C containing the true distribu- 
tion fi, we have 









In E e{a\x<t) 






< 2w„ and 

— /j. 



- Eei-E g{a\x<t) 



aGX 



aGX 



< 2w^ 



-1 



Consequently, g(a|a;<t) ^(a|a;<t), and for almost all x<oo G X°° , the nor- 

malizer Ng defined in (7) converges to a number which is finite and greater than 
zero, i.e. 0 < Ng{x^ao) < oo- 

Proof, (i) Define u~^ = max{0, u} for u G K, then for x := x^t G we have 



E 



In ^ g{a\x) 






= E 



In 



Ea 



= E 



< E- 



g{x) 

Eag(g^g) - 

g{x) 






In 



g{x) 



T,a g(a^a) 



E 



{q{x) - Eag(a;a))’ 



y- M(a^) ( Eg ~ ^ 

^ g{x) ^ 



T,a g(a;a) 

H(x){g{x) - Egg(a^a))"^ 



(x)=t — l 



(^x)-t— 1 



Eg g(a^a) 



^ w'm^E (Egg(a^a) - g(a^))’^ + ^ E “ Eg g(a;a))’^ 

£{x)—t—l i(x)—t — l 

= w^p^E [Egg(a^a) - g(a;) +2(6i(x) -X;gg(a;a))’^ • 

£{x)—t—l 



Here, |u| = + (— u)’*’ = —u + 2m+, In u < m — 1, and g > Wg,fj, have been used, 

the latter implies also Eg g(^®) — = ru^/r(x). The last expression 

in this (in)equality chain, when summed over t = I ...00 is bounded by 2w~^ by 
essentially the same arguments (10) - (13) as in the proof of Theorem 8. 

(ii) Let again x := x<t and use gnorm{o,\x) = ^?(a|a;)/^^ (>{b\x) to obtain 



^ ^ |gnorm(^t|^) g(tt|^) 
a 



= E 



£»(a|x) 



E&g(^k) 






(Egg(a^a) - g(a;))’^ , (g(a;) - Eg g(a^a)) 



g{x) 



g{x) 



Then take the expectation E and the sum EEi proceed as in (t). Finally, 
p(a|a:<t) g,(^a\x^t) follows by combining (ii) with Theorem 8, and by (i), 

EillnS efi<d I bounded in n with /i-probability 1, thus the same is true 
for In fV,(x<oo) = Erin ° 
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5 Static MDL 



So far, we have considered dynamic MDL from Definition 2. We turn now to the 
static variant (Definition 3), which is usually more efficient and thus preferred 
in practice. 

Theorem 11. For any class of (semi)measures C containing the true distribu- 
tion n, we have 

OO 

I £'«orm(a|a;<t) (ak<t) 

t=l a£X t=l a&X 

Proof. We proceed in a similar way as in the proof of Theorem 8, (10) - (12). 
From Lemma 6, we know g{x) — g^(xa) < ^(x) — Then 



= E 



E 



1 - 



E 



(a|x<t) 



< 



^E|l-^e"<*(a|x<t)|=^ E 
, ^(a;) - EasAT 






= E E 

n 

E 

i=i e{x)=t-i 



g{x) 



Qixct) 

n 

^ E E “ E 

t=l^(a;)=t-l a£X 







C(e) - E 






£{x)—n 



< 



for all n G N. This implies the assertion. Again we have used ^ and the 

fact that positive and negative terms cancel for successive t. □ 



Corollary 12. Let C contain the true distribution n, then 

J2t^T,a (Fia\x^t) - gnovm{a\x^t)Y < 

EtEEa (/a(a|a;<t) - £»(a|x<t))^ < 

Et^Ea (/a(a|a:<t) - e"^<‘(a|a;<t))^ < 21w~\ 

Et^Ea (/a(a|a:<t) - 0no™(a|a:<t))^ < 32-w;;^ 



Proof. This follows by combining the assertions of Theorems 8-11 with the 
triangle inequality. For static MDL, use in addition Eo |0(®|a^) ~ 0^(a|a;)| = 
|Ea£'(a|a;) - Eae“(a|a:)| < | Ea ^(ala:) - l| + |l - Ea which follows 

from g{xa) > g^(xa). □ 



This corollary recapitulates our results and states convergence i.m.s (and 
therefore also with /i-probability 1) for all combinations of un-normalized/nor- 
malized and dynamic/static MDL predictions.^ 

^ We briefly discuss the choice of the total expected square error for measuring speed 
of convergence. The expected Kullback-Leibler distance may seem more natural in 
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6 Hybrid MDL and Stabilization 

We now turn to the hybrid MDL variant (see Definition 4). So far we have not 
cared about what happens if two or more (semi)measures obtain the same value 
Wvv{x) for some string x. In fact, for the previous results, the tie-breaking strategy 
can be completely arbitrary. This need not be so for all thinkable prediction 
methods other than static and dynamic MDL, as the following example shows. 

Example 1 3. Let X = M and C contain only two measures, the uniform measure A 
which is defined by A(x) = and another measure v having i>{lx) = 

and iy{0x) = 0. The respective weights are wa = § and Wi, = ^. Then, for each 
x starting with 1, we have Wvv{x) = w\\{x) = Therefore, for all 

x<oo starting with 1 (a set which has uniform measure |), we have a tie. If the 
maximizing element v* is chosen to be A for even t and v for odd t, then both 
static and dynamic MDL constantly predict probabilities of | for all a G B. 
However, the hybrid MDL predictor values oscillate between | and 1. 

If the ambiguity in the tie-breaking process is removed, e.g. if always the 
measure with the larger weight Wi, is been chosen, then the hybrid MDL predictor 
does converge for this example. If there are more (semi) measures in the class 
and there remains still a tie of shortest programs, an arbitrary program can be 
selected, since then the respective measures are equal, too. In the following, we 
assume that this tie-breaking rule is applied. 

Do the hybrid MDL predictions always converge then? This is equivalent to 
asking if the process of selecting a maximizing element eventually stabilizes. If 
there is no stabilization, then hybrid MDL will necessarily fail as soon as the 
weights are not equal. A possible counterexample could consist of two measures 
the fraction of which oscillates perpetually around a certain value. This can 
indeed happen. 

Example 14- Let X be binary, /i(x) = Yii=i and u{x) = rii=i with 

/i,(l) = 1 - 2"^r5l and v,{l) = 1 - 2"^r^l+i. 

Then one can easily see that ^(111...) = ^ i^(lll...) = 

> 0, and is convergent but oscillates around its limit. There- 

fore, we can set and Wi, appropriately to prevent the maximizing element from 
stabilizing on x^oo = 111 . . . (Moreover, each sequence having positive measure 
under fj, and n contains eventually only ones, and the quotient oscillates.) 

the light of our proofs. However, this quantity behaves well only under dynamic 
MDL, not static MDL. To see this, let C be the class of all computable Bernoulli 
distributions and g the measure having g{0) = g{l) = |. Then the sequence x = 0" 
has nonzero probability. For sufficiently large n, = no holds (typically already for 
small n), where no is the distribution generating only 0. Then D{g\\n^) = oo, and 
the expectation is oo, too. The quadratic distance behaves locally like the Kullback- 
Leibler distance (Lemma 7), but otherwise is bounded and thus more convenient. 
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The reason for the oscillation in this example is the fact that measures ^ 
and V are asymptotically very similar. One can also achieve a similar effect by 
constructing a measure which is dependent on the past. This shows in particu- 
lar that we need both parts of the following definition which states properties 
sufficient for a positive result. 

Definition 15. (i) A (semi)measure v on X°° is called factorizable if there are 
(semi)measures Vi on X such that u{x) = Y\i=i for all x G X*. That is, 

the symbols of sequences x^^o generated by v are independent. 

{ii) A factorizable (semi)measure p, = Y\p,i is called uniformly stochastic, if 
there is some i5 > 0 such that at each time i the probability of all symbols a € X 
is either 0 or at least 6. That is, pi{a) > 0 pi{a) > 5 for all a G X and i > 1. 

In particular, all deterministic measures are uniformly stochastic. Another 
simple example of a uniformly stochastic measure is a probability distribution 
which generates alternately random bits by fair coin flips and the digits of the 
binary representation of tt. 

Theorem 16. Let C be a countable class of factorizable (semi) measures and p 
be uniformly stochastic. Then the maximizing element stabilizes almost surely. 

We omit the proof. So in particular, under the conditions of Theorem 16, 
the hybrid MDL predictions converge almost surely. No statement about the 
convergence speed can be made. 

7 Complexities and Randomness 

In this section, we concentrate on universal sequence prediction. It was men- 
tioned already in the introduction that this is one interesting application of the 
theory developed so far. So C = AI is the countable set of all enumerable (i.e. 
lower semicomputable) semimeasures on X*. (Algorithms are identified with 
semimeasures rather than measures since they need not terminate.) M con- 
tains stochastic models in general, and in particular all models for computable 
deterministic sequences. One can show that this class A4 is determined by all 
algorithms on some fixed universal monotone Turing machine U [9, Th. 4.5.2]. 
By this correspondence, each semimeasure i/ G A4 is assigned a canonical weight 
Wu = (where K{v) is the Kolmogorov complexity of v, see [9, Eq. 4.11]), 

and '^w,^ < 1 holds. We will assume programs to be binary, i.e. p G B*, in 
contrast to outputs, which are strings x G X*. 

The MDL definitions in Section 3 directly transfer to this setup. All our 
results (Theorems 8-11) therefore apply to g = £<[_v(] if the true distribution 
p is a measure, which is not very restrictive. Then p is necessarily computable. 
Also, Theorem 1 implies Solomonoff’s important universal induction theorem: ^ 
converges to the true distribution i.m.s., if the latter is computable. Note that 
the Bayes mixture f is within a multiplicative constant of the Solomonoff- Levin 
prior M{x), which is the algorithmic probability that U produces an output 
starting with x if its input is random. 
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In addition to A4, we also consider the set of all recursive measures Ai to- 
gether with the same canonical weights, and the mixture ^(x) = Wvv{x). 

Likewise, define g = Then we obviously have g(x) < ^(x) < ^(x) and 

X 

(^(x) < ^(x) for all X G A*. It is even immediate that ^(x) < g(x) since ^ G Ai. 

X ^ X 

Here, hy f < g we mean f < g ■ 0(1), and “=” are defined analogously. 

Moreover, for any string x G X*, there is also a universal one-part MDL 
estimator m{x) = derived from the monotone complexity Km{x) = 

mhi{£{p) : U{p) = x*}. (I.e. the monotone complexity is the length of the 
shortest program such that C/’s output starts with x.) The minimal program 
p defines a measure v with i^{x) = 1 and Wi, > . 0(1) (recall that programs 

X 

are binary). Therefore, m(x) < g{x) for all x G X*. Together with the following 
proposition, we thus obtain 

m{x) = g{x) < f(x) < g{x) = ^(x) for all x G X* . (14) 



X 

Proposition 17. We have g{x) < m{x) for all x € X* . 

Proof. (Sketch only.) It is not hard to show that given a string x G X* and a 
recursive measure v (which in particular may be the MDL descriptor v*{x)) it 
is possible to specify a program p of length at most — log 2 W,y — log 2 i^(x) -I- c that 
outputs a string starting with x, where constant c is independent of x and v. 
This is done via arithmetic encoding. Alternatively, it is also possible to prove 

X 

the proposition indirectly using [9, Th.4.5.4]. This implies that m(x) > w^v{x) 

X 

for all X G X* and all recursive measures v. Then, also m(x) > m.ax{w^i'{x)} 
holds. □ 

On the other hand, we know from [19] that m ^ Therefore, at least one of 
the two inequalities in (14) must be proper. 

X _ _ X 

Problem 18. Which of the inequalities g < f, and ^ < p is proper (or are both)? 

Equation (14) also has an easy consequence in terms of randomness criteria. 

Proposition 19. A sequence x<oo G X°° is Martin-Lof random with respect to 
some computable measure fi iff for any f G {m, g,^,r, M} there is a constant 
C > 0 such that f{xi.,n) < C/i(xi:„) for oCC n G N holds. 

Proof. It is a standard result that if x<oo is random then M{x\.,n) < C'/r(xi:„) 
for some C [20, Th.3]. Then by (14), /(xi:„) < /x(xi:„) for all /. Conversely, if 

X 

f{xi.,„) < /r(xi:„) for some /, then there is C such that m(xi:„) < C/x(xi:„). 
This implies /x-randomness of x<oo ([20, Th.2] or [9, p295]). □ 
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Interestingly, these randomness criteria partly depend on the weights. The 
criteria for ^ and g are not equivalent any more if weights other than the canon- 
ical weights are used, as the following example will show. In contrast, for ^ and 
g there is no weight dependency as long as the weights are strictly greater than 
zero, since ^ G Ai. 

Example 20. There are other randomness criteria than Martin-L6f randomness, 
e.g. rec-randomness. A rec-random sequence cc<oo (with respect to the uniform 
distribution) satisfies i^{xi.n) < c{v)2~^ for each computable measure v and for 
all n. It is obvious that Martin-L6f random sequences are also rec-random. The 
converse does not hold, there are sequences x<oo that are rec-random but not 
Martin-L6f random, as shown e.g. in [21,22]. 

Let x<oo be such a sequence, i.e. v{xi.,n) < c ( jz ) 2 “” for all computable mea- 
sures V and for all n, but where cc<oo is not Martin-L6f random. Let r' 2 > ■ • ■ be a 
(non-effective) enumeration of all computable measures. Define w[ = 2~'' . 
Then 



= '^w[vi{x-i.,n) <'^‘2 "'c{vi) ^ c{i^i)2 ” = 2 " for all n, 

i=l i=l 

i.e. a;<oo is M'-random. Thus, a;<oo is also f'-random with f' = maxi{w'^i/i}. 

8 Conclusions 

We have proven convergence theorems for MDL prediction for arbitrary count- 
able classes of semimeasures, the only requirement being that the true distri- 
bution /i is a measure. Our results hold for both static and dynamic MDL and 
provide a statement about convergence speed in mean sum. This also yields both 
on-sequence and off-sequence assertions. Our results are to our knowledge the 
strongest available for the discrete case. 

Compared to the bound for Bayes mixture prediction prediction in Theorem 
1, the error bounds for MDL are exponentially worse, namely w~^ instead of 
lnt<;“^. Our bounds are sharp in general, as Example 9 shows. There are even 
classes of Bernoulli distributions where the exponential bound is sharp [18]. 

In the case of continuously parameterized model classes, finite error bounds 
do not hold [6,4], but the error grows slowly as Inn. Under additional assump- 
tions (i.i.d. for instance) and with a reasonable prior, one can prove similar 
behavior of MDL and Bayes mixture predictions [5]. In this sense, MDL con- 
verges as fast as a Bayes mixture there. This fast convergence even holds for the 
“slow” Bernoulli example in [18]. However in Example 9, the error grows as n, 
which shows that the Bayes mixture can be superior to MDL in general. 
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Abstract. We present a general information exponential inequality that 
measures the statistical complexity of some deterministic and random- 
ized density estimators. Using this inequality, we are able to improve 
classical results concerning the convergence of two-part code MDL in [1] . 
Moreover, we are able to derive clean finite-sample convergence bounds 
that are not obtainable using previous approaches. 



1 Introduction 

The purpose of this paper is to study a class of complexity minimization based 
density estimation methods using a generalization of e-entropy which we call 
KL-complexity. Specifically, we derive a simple yet general information theoret- 
ical inequality that can be used to measure the convergence behavior of some 
randomized estimation methods. Consequences of this very basic inequality will 
then be explored. In particular, we apply this analysis to the two-part code MDL 
density estimator studied in [1], and refine their results. 

We shall first introduce basic notations used in the paper. Consider a sample 
space X and a measure /i on T (with respect to some a-field). In statistical 
inferencing, the nature picks a probability measure Q on X which is unknown. 
We assume that Q has a density q with respect to /i. In density estimation, 
the statistician considers a set of probability densities p{-\0) (with respect to p 
on X) indexed hy 9 € F} Throughout this paper, we always denote the true 
underlying density by q, which may not belong to the model class F. Given F, 
the goal of the statistician is to select a density p{-\9) G F based on the observed 
data X = {Xi, . . . , A„} G T”, such that p{-\9) is as close to q as possible when 
measured by a certain distance function (to be specify later). 

In the framework considered in this paper, we assume that there is a prior 
distribution diriO) on the parameter space F that is independent of the observed 
data. For notational simplicity, we shall call any observation X dependent prob- 
ability density wx{9) on F (measurable on T” x F) with respect to dn{9) a 
posterior randomization measure. In particular, a posterior randomization mea- 
sure in our sense is not limited to the Bayesian posterior distribution, which has 
a specific meaning. We are interested in the density estimation performance of 

^ Without causing any confusion, we may also occasionally denote the model family 
{^(•IS) : 0 G F} by the same symbol F. 
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randomized estimators that draw 0 according to posterior randomization mea- 
sure wx{0) obtained from a class of density estimation schemes. We should note 
that in this framework, our density estimator is completely characterized by the 
associated posterior randomization density wx{0). 

2 Information Complexity Minimization Method 

We introduce an information theoretical complexity measure of randomized es- 
timators represented as posterior randomization densities. 

Definition 1. Consider a probability density w{-) on F with respect to tt. The 
KL-divergence DKL{wdTr\\dTr) is defined as: 

DKL{wdn\\dn) = j w{9)hiw{9)dTT{9). 



The definition becomes the differential entropy for measures on a real-line, when 
we choose the uniform prior. If we place the prior uniformly on an e-net of the 
parameter space, then the KL-compleixty becomes e-entropy. KL-divergence is a 
rather standard information theoretical concept. We will later show that it can 
be used to measure the complexity of a randomized estimator. We call such a 
measure the KL- complexity or KL-entropy of a randomized estimator. 

For a real-valued function f{9) on F, we denote by E,r/(6*) the expectation of 
/(•) with respect to tt. Similarly, for a real- valued function i{x) on X , we denote 
by 'Eiq(.{x) the expectation of (.{■) with respect the true underlying distribution 
q. We also use Ex to denote the expectation with respect to the observation X 
(n independent samples from q). 

The MDL method (7) which we will study in Section 5 can be regarded as a 
special case of a general class of estimation methods which we refer to as Informa- 
tion Complexity Minimization. The method produces a posterior randomization 
density. Let S' be a pre-defined set of densities on F with respect to the prior tt. 
We consider a general information complexity minimization estimator: 



S 

Wx = argmm 
w^S 



n 

— E,r w{9) ^ lnp{Xi\9) + XDKL{wdTr\\dTT) . 

i=l 



( 1 ) 



If we let S be the set of all possible posterior randomization measures, then 
the estimator leads to the Bayesian posterior distribution with A = 1 (see [11]). 
Therefore bounds obtained for (1) can also be applied to Bayesian posterior 
distributions. Instead of focusing on the more special MDL method presented 
later in (7), we shall develop our analysis for the general formulation in (1). 



3 The Basic Information Theoretical Ineqnality 

The key ingredient of our analysis using KL-complexity is a well-known convex 
duality, which has already been used in some recent machine learning papers to 
study sample complexity bounds [5,7]. 
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Proposition 1. Assume that f{9) is a measurable real-valued function on F, 
and w{9) is a density with respect to tt, we have w{9)f{9) < DKL{wdT:\\dT:)-\- 
In exp(/(0)). 

The basis of the paper is the following lemma, where we assume that wx{9) is 
a posterior randomization measure (density with respect to tt that depends on 
X and measurable on T" XT). 

Lemma 1 (Information Exponential Inequality). Consider any posterior 
randomization density wx{9). Let a and f3 he two real numbers. The following 
inequality holds for all measurable real-valued functions Lx (9) on ff” x F: 

Ex exp \^^wx{9)(Lx(9) - a In Ex - T>xL(t&xd7r||d7r)J < E,, ^a^0Lx{e) ' 

where Ex is the expectation with respect to the observation X. 

Proof. From Proposition 1, we obtain 

L{X) =E.^wx{9){Lx{ 9) - alnExe^^^^^^) - DKL{wxdTT\\dn) 
<lnE,rexp(Lx(^) — a In Ex 



Now applying Fubini’s theorem to interchange the order of integration, we have: 



□ 






Exe^^W 

E^fAFF(0j- 



Remark F The main technical ingredients of the proof are motivated from tech- 
niques in the recent machine learning literature. The general idea for analyzing 
randomized estimators using Fubini’s theorem and decoupling was already in 
[10]. The specific decoupling mechanism using Proposition 1 appeared in [5,7] 
for related problems. A simplified form of Lemma 1 was used in [11] to analyze 
Bayesian posterior distributions. 



The following bound is a straight-forward consequence of Lemma 1. Note 
that for density estimation, the loss £g{x) has a form of £{p{x\9)), where £(■) is 
a scaled log-loss. 

Theorem 1 (Information Posterior Bounds). Using the notation of 
Lemma 1. Let X = {Xi,... ,X„} be n-samples that are independently drawn 
from q. Consider a measurable function £g{x) : F x X ^ R. Consider real num- 
bers a and £3, and define 



Cn{a,P) = -lnE,r 
n 



\E<££e-9^<>C) 



n 



Then Vt, the following event holds with probability at least 1 — exp(—f).- 
-aE.«x(^)lnE,e-^^»W < + DKUwxdn\\dn) + t ^ 
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Moreover, we have the following expected risk bound: 

^ T,7^i^e{Xi) + DKL(wxdTT\\dTr) 



— aEx^TrWx{8) In Eg e 



< Ex- 



■ Cn{a,/3). 



Proof. We use the notation of Lemma 1, with Lx{d) = If we 

define L{X) = E,TrWx{9){Lx{0) — — DKLiwxdnWdTr), then by 

Lemma 1, we have This implies Ve: e^P{L{X) > e) < 

^nc„(a,i3) _ ^ — p(L(X) < e)), we obtain e < nc„(a,/3)+t. Therefore 

with probability at least 1 — e“*, L{X) < e < nc„(a,/3) + t. Rearranging, we 
obtain the first inequality of the theorem. 

To prove the second inequality, we still start with Exe^^^^ < from 

Lemma 1. From Jensen’s inequality with the convex function e^, we obtain 
gExh(X) ^ Exe'^^^^ < That is, ExL(W) < nc(a,/3). Rearranging, we 

obtain the desired bound. □ 

Remark 2. The special case of Theorem 1 with a = /3 = 1 is very useful since in 
this case, the term c„(a,/3) vanishes. In fact, in order to obtain the correct rate 
of convergence for non-parametric problems, it is sufficient to choose a = j3 = 1. 
The more complicated case with general a and j3 is only needed for parametric 
problems, where we would like to obtain a convergence rate of the order 0(l/n). 
In such cases, the choice oi a = (3 = \ would lead to a rate of 0(lnn/n), which 
is suboptimal. 



4 Bounds for Information Complexity Minimization 

Consider the Information Complexity Minimization (1). Given the true density 
q, if we define 



Rx{w) = -E,,w(6>)^ln + -DKL{wd'n\\dr:), (2) 

then it is clear that 



Wx = arg min i?A(w). 

wGS 



The above estimation procedure finds a randomized estimator by minimizing 
the regularized empirical risk R\{w) among all possible densities with respect 
to the prior tt in a pre-defined set S. 

The purpose of this section is to study the performance of the estimator 
defined in (2) using Theorem 1. For simplicity, we shall only study the expected 
performance using the second inequality, although similar results can be obtained 
using the first inequality (which leads to exponential probability bounds). 
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One may define the true risk of w by replacing the empirical expectation in 
( 1 ) with the true expectation with respect to q: 

R\{w) = E^w{6)DKL{q\\p{-\d)) + -DKLiwdTT\\dn), (3) 

n 

where DxhiqWp) = E, ln(g(a;)/p(a;)) is the KL-divergence between q and p. The 
information complexity minimizer in ( 1 ) can be regarded as an approximate 
solution to (3) using empirical expectation. 

Using empirical process techniques, one can typically expect to bound R\{w) 
in terms of R\{w). Unfortunately, it does not work in our case since DxhiqWp) 
is not well-defined for all p. This implies that as long as w has non-zero concen- 
tration around a density p with DKh{q\ \p) = +oo, then R\{w) = -l-oo. Therefore 
we may have Rx{wf^) = -l-oo with non-zero probability even when the sample 
size approaches infinity. 

A remedy is to use a distance function that is always well-defined. In statis- 
tics, one often considers the p-divergence for p G ( 0 , 1 ), which is defined as: 



■Dp(g|b) 



P(l-P)^^ 




(4) 



This divergence is always well-defined and Dkl{<i\\p) = linip-s-o ^p('zlb)- In the 
statistical literature, convergence results were often specified under the Hellinger 
distance (p = 0.5). In this paper, we specify convergence results with general p. 
We shall mention that bounds derived in this paper will become trivial when 
p — >■ 0. This is consistent with the above discussion since R\ (corresponding to 
p = 0) may not converge at all. However, under additional assumptions, such 
as the boundedness of q/p, Dkl{(i\\p) exists and can be bounded using the p- 
divergence Dp{q\\p). 

The following bounds imply that up to a constant, the p-divergence with 
any p G (0, 1) is equivalent to the Hellinger distance. Therefore a convergence 
bound in any p-divergence implies a convergence bound of the same rate in the 
Hellinger distance. Since this result is not crucial in our analysis, we skip the 
proof due to the space limitation. 

Proposition 2. We have the following inequalities Vp G [0, !].• 

max(p, 1 - p)i:)p(g||p) > ^£> 1 / 2 ( 9 ! |p) > min(p, 1 - p)Dp(g| |p). 



4.1 A General Convergence Bound 

The following general theorem is an immediate consequence of Theorem 1 . Most 
of our later discussions can be considered as interpretations of this theorem under 
various different conditions. 
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Theorem 2. Consider the estimator Wx defined in (1). Let a > 0. Then Vp € 
(0, 1) and 7 > p such that X' = > 0, we have: 



ExE,wi(e)Dp(?||p(.|e)) < 



< 



-1 

P(1 - P) 



ExE^w|(P) In Eg 



( p{x\e) 

V '?(*) 



P 



7mft„ss Rxjw) 
ap(l - p) 



7-p 
ap(l - p) 



Ex-Ra'(wx) + 



ep,n (o) 
ap(l - p)’ 



where 



^p,n (ck) 



1 

n 






( p{x\0) 

V 



p 



Proof Sketch. Consider an arbitrary data-independent density w{9) G S with 
respect to tt, using (4), we can obtain from Theorem 1 the following chain of 
equations: 



ap{l-p)ExM^x{0)Dp{qM-\e)) 



<Q!ExE^u;x( 0) In 



I - p{l - p)Dp{q\\p{-\9)) 



<Ex 

<Ex 



I DKL{w^dTr\\d7r) 

pE^ Wx > - in H 

p{X,\9) n 

7Ea(w) + (p- 7)fiA'(wx)j +Cp,„(a), 



(q:) 



where R\{w) is defined in (3). □ 



Remark 3. If 7 = p in Theorem 2, then we also require A7 = 1, and let A' = 0. 

Consequences of this theorem will later be applied to MDL methods. Al- 
though the bound in Theorem 2 looks complicated, the most important part on 
the right hand side is the first term. The second term is only needed to handle 
the situation A < 1. The requirement that 7 > p is to ensure that the second 
term is non-positive. Therefore in order to apply the theorem, we only need 
to estimate a lower bound of Ry{wx), which (as we shall see later) is much 
easier than obtaining an upper bound. The third term is mainly included to 
get the correct convergence rate of 0(l/n) for parametric problems, and can 
be ignored for non-parametric problems. The effect of this term is quite simi- 
lar to using localized e-entropy in the empirical process approach for analyzing 
the maximum-likelihood method (for example, see [8]). As a comparison, the 
KL-entropy in the first term corresponds to the global e-entropy. 

Note that one can easily obtain a simplified bound from Theorem 2 by choos- 
ing specific parameters so that both the second term and the third term vanish: 




On the Convergence of MDL Density Estimation 



321 



Corollary 1. Consider the estimator Wx defined in (1). Assume that A > 1 
and let p = 1/A, we have 

¥.x'E^w^x{S)Dp{q\\p{-\e)) < inf i?A(w)- 

1 — p w^S 



Proof. We simply let a = 1 and 7 = p in Theorem 2. □. 

An important observation is that for A > 1, the convergence rate is solely 
determined by the quantity infu,gg i?A(w), which we shall refer to as the model 
resolvability associated with S. 

4.2 Some Lower Bounds on ExRx'iw^) 

Lemma 2. VA' > 1; ExRx'{wx) > — ^ > 0. 

Proof. See Appendix A. □ 

By combining the above estimate with Theorem 2, we obtain the following 
refinement of Corollary 1. 

Corollary 2. Consider the estimator Wx defined in (1). Assume that A > 1, 
then Vp G (0, 1/A]; 

Ex'E^w^{O)Dp{q\\p{-\0)) < mf^ i?A H • 



Proof. We simply let a = 1 and 7 = (1 — p)/(A — 1) in Theorem 2. Note that in 
this case, A' = 1, and hence by Lemma 2, ExR\'{wx) > 0. □ 

Note that Lemma 2 is only applicable for A' > 1. If A' < 1, then we need 
a discretization device, which generalizes the upper e-covering number concept 
used in [2] for showing the consistency (or inconsistency) of Bayesian posterior 
distributions: 

Definition 2. The e-upper bracketing number of P, denoted by N{P,e), is the 
minimum number of non-negative functions {fj} on X with respect to p such 
that 'Eq{fj/q) = 1 -|- e, and \/0 G P, 3j such that p{x\9) < fj{x) a.e. [p]. 

The discretization device which we shall use in this paper is based on the 
following definition: 

Definition 3. An e-upper discretization of P consists of a countable de- 
composition of P as measurable subsets {C/} such that CjPj = P and 
Eqsupg(,p.{p{x\9)/q{x)) < 1-he. 



Lemma 3. Consider an e-upper discretization {Pj} ofP. The following inequal- 
ity is valid VA' G [0, 1]; 






-I- ln(l -I- e) 



^xR\'{wx) > 



n 
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Proof. See Appendix B. □ 

Combine the above estimate with Theorem 2, we obtain the following sim- 
plified bound for A = 1. Similar results can be obtained for A < 1 but the case 
of A = 1 is most interesting. 

Corollary 3. Consider the estimator defined in (1). Let A = 1. Consider an 
e-upper discretization {Tj} of P . Vp G (0, 1) and V 7 > 1, we have: 



T^x^^wi{e)Dfiq\\p{-\e)) < 



7inft„g5 Aa(w) 7"P 

p(l-p) p(l-p) 



n 



+ ln(l + t) 



Proof. We let a = 1 in Theorem 2, and apply Lemma 3. □ 

Note that the above results immediately imply the following bound using 
e-upper entropy by letting 7 — >■ 1 with a finite e-upper bracketing cover of size 
N{P, e) as the discretization: 



ExE^wUe)Dp{q\\p{-\9)) < 



S R\{w) 1 . 



d(l - P) 



-\ — inf 

p «>o 



■lnA^(T,e) 



-I- ln(l -I- e) 



( 5 ) 



It is clear that Corollary 3 is significantly more general than the covering number 
result (5). We are able to deal with an infinite cover as long as the decay of the 
prior 7 T is fast enough on the discretization so that < -l-oo. 



4.3 Weak Convergence Bound 

The case of A = 1 is related to a number of important estimation methods 
in statistical applications such as the standard MDL and Bayesian methods. 
However, for an arbitrary prior tt without any additional assumption such as the 
fast decay condition in Corollary 3, it is not possible to establish any convergence 
rate result in terms of Hellinger distance using the model resolvability quantity 
alone, as in the case of A > 1 (Corollary 2). See Section 5.4 for an example 
demonstrating this claim. However, one can still obtain a weaker convergence 
result in this case. The following theorem essentially implies that the posterior 
randomization average E.„.'u)^( 6 l)p(-| 6 *) converges weakly to q as long as the model 
resolvability infi„g 5 R\{w) — >■ 0 when n — >■ 00 . 

Theorem 3. Consider the estimator Wx defined in (1) with A = 1. Then V/ : 
X — >■ [—1,1], we have: 



Ex 



E^wf (6»)Ep(.|g)/(a;) 



1 

n 



2 = 1 



< 2 A„ -|- \/ 2 A„, 



where A„ = infu,gg ExR\{w) -\- 



ln 2 



n 




On the Convergence of MDL Density Estimation 323 



Proof Sketch. Let g^{x) = 1 - ef{x), and h^{x) = where e G (-1, 1) 

is a parameter to be determined later. Note that ge{x) > 0. Let a = P = 1 and 
Lx{0) = — ln/ie(Xi) in Lemma 1, we have 



Ex exp 



E, 



rWx[ 



(0) -^ln/i,(X,)-lnExn 



i=l 



UXi) 



- Z?xL (wfd7r||d7r) 



< 1 . 



If we let 



A,{X) = E,w|-(0) 



'^lng^{XP - nlnEp(.|g)5fe(a:) 



then < i. This implies that + 

gZi_e(x)jg-ni?^(jux) < 2. Applying Jensen’s inequality, we obtain 

Ex ln[e"’^'^^^ + < nEx.RA(tt’x) + < n inf i?A(w) + ln2. (6) 

wGS 



Consider x < y < 1. We have the following inequalities (which follow from 
Taylor expansion) x < — ln(l— x) < x+ 2(l-vY . This implies ln 5 £(a;) > —ef{x) — 
2 {i-\e\y -lnEp(.|e)3e(a;) > eEp(.|g)/(a;). Therefore 



A,{X) > eE,u;|(0) 



-^/(W) + nEp(.|,)/(a:) 



2(l-|e|)2- 



A similar bound can be obtained for A_^{X). Now substitute them into (6) and 
observe that \x\ < ln(e^ + e~^), we obtain 



E, 



eE. 



r^xt 



(0) -^/(A,)+nEp(.|e)/(x) 



2(1 -kl)^ 



< n inf ExAa(w) + In 2. 

w^S 



Let |e| = yj2An!{\/2An + 1), we obtain the desired bound. □ 



5 MDL on Discrete Net 

The minimum description length (MDL) method has been widely used in prac- 
tice [6]. The version we consider here is the same as that of [1]. In fact, results 
in this section improve those of [1]. The MDL method considered in [1] can be 
regarded as a special case of information complexity minimization. The model 
space r is countable: 9 G P = {1,2,...}. We denote the corresponding mod- 
els p{x\9 = j) by Pj{x). The prior tt has a form tt = (tti, 7T2, . . . } such that 

Xj = 1, where we assume that Xj > 0 for each j. A randomized algorithm can 
be represented as a non-negative weight vector w = [wj] such that TTjWj = 1. 

MDL gives a deterministic estimator, which corresponds to the set of weights 
concentrated on any one specific point k. That is, we can select S in (1) such 
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that each weight w in S' corresponds to an index k € F such that Wk = I/tt^ 
and Wj = 0 when j ^ A:. It is easy to check that DKLiwdnWdn) = ln(l/7Tfc). The 
corresponding algorithm can thus be described as finding a probability density 
with k obtained by 

r ” 1 1 1 

fc = argmin > In — + A In — , (7) 

k Pk{Xi) TTfcJ 

where A > 1 is a regularization parameter. The first term corresponds to the 
description of the data, and the second term corresponds to the description of the 
model. The choice A = 1 can be interpreted as minimizing the total description 
length, which corresponds to the standard MDL. The choice A > 1 corresponds to 
heavier penalty on the model description, which makes the estimation method 
more stable. This modified MDL method was considered in [1] for which the 
authors obtained results on the asymptotic rate of convergence. However, no 
simple finite sample bounds were obtained. For the case of A = 1, only weak 
consistency was shown. In the following, we shall improve these results using the 
analysis presented in Section 4. 

5.1 Modified MDL under Global Entropy Condition 

Consider the case A > 1 in (7). We can obtain the following theorem from 
Corollary 2. 

Theorem 4. Consider the estimator k defined in (7). Assume that A > 1, then 
VpG(0,l/A].- 

^xDp{q\\pff)< ^ inf T>KL(g|bfc) + - In — . 

p(A — 1) fc [ n 7Tfc_ 

Note that in [1], the term r\^n{q) = inffc F>KL{q\\pk) + ^ is referred 

to as index of resolvability. They showed (Theorem 4) that Di/ 2 {q\\pj^) = 
Op{r\^n{q)) when A > 1. Theorem 4 is a slight generalization of a result de- 
veloped by Andrew Barron and Jonathan Li, which gave the same inequality 
but only for the case of A = 2 and p = 1/2. The result, with a proof quite similar 
to what we presented here, can be found in [4] (Theorem 5.5, page 78). 

Examples of index of resolvabilities for various function classes can be found 
in [1], which we shall not repeat in this paper. In particular, it is known that for 
non-parametric problems, with appropriate discretization, the rate matches the 
minimax rate such as those in [9] . 

5.2 Local Entropy Analysis 

Although the bound based on the index of resolvability in Theorem 4 is quite 
useful for non-parametric problems (see [1] for examples), it does not handle 
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the parametric case satisfactorily. To see this, we consider a one-dimensional 
parameter family indexed hy 6 G [0, 1 ], and we discretize the family using a 
uniform discrete net of size -|- 1: 6j = j/N (j = 0, . . . , N). If q is taken from 
the parametric family so that we can assume that inffc 

then Theorem 4 with A = 2, p=l/2 and uniform prior on the net, becomes 
ExA’i/ 2 (g|bfe) < + Now by choosing iV = 0(n“^/^), we obtain a 

suboptimal convergence rate ExDi/ 2 {q\\p^) < 0{lnn/n). Note that convergence 
rates established in [ 1 ] for parametric examples are also of the order 0{lnn/n). 

The main reason for this sub-optimality is that the complexity measure 
O(lnN) or 0 (— Iutt^) corresponds to the globally defined entropy. However, 
readers who are familiar with the empirical process theory know that the rate of 
convergence of the maximum likelihood estimate is determined by local entropy 
which appeared in [3]. For non-parametric problems, it was pointed out in [9] 
that the worst case local entropy is the same order of the global entropy. There- 
fore a theoretical analysis which relies on global entropy (such as Theorem 4) 
leads to the correct worst case rate at least in the minimax sense. For parametric 
problems, at the 0 (l/n) approximation level, local entropy is constant but the 
global entropy is Inn. This leads to a ln(n) difference in the resulting bound. 

Although it may not be immediately obvious how to define a localized coun- 
terpart of the index of resolvability, we can make a correction term which has the 
same effect. As pointed out earlier, this is essentially the role of the Cp_„(o;) term 
in Theorem 2. We include a simplified version below, which can be obtained by 
choosing a = 1/2, and 7 = p = 1/A. 

Theorem 5. Consider the estimator k defined in (1). Assume that A > 1, and 
let p=l/X: 



ExA>p(g|bfc) < 



inf 

1 — p k 



DKL{q\\Pk) -I- - In 
n 



A, 



( Pj iC \ 
V 9 ( 2 :) J 



p-l 






The bound relies on a localized version of the index of resolvability, with the 
global entropy — InTr^ replaced by a localized entropy InX^j ~ 

luTTfc. Since 






the localized entropy is always smaller than the global entropy. Intuitively, we 
can see that if Pj{x) is far away from q{x), then ( ^q{x) ) very small 

as n — >■ 00. It follows that the summation in Tr^-Eg mainly 

contributed by terms such that Dp{q\\pj) is small. This is equivalent to a re- 
weighting of prior in such a way that we only count points that are localized 
within a small Dp ball of q. 
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This localization leads to the correct rate of convergence for parametric prob- 
lems. The effect is similar to using localized entropy in the empirical process 
analysis. We consider the maximum likelihood estimate with a general one di- 
mensional problem discussed at the beginning of the section with a uniform 
discretization consisted of fV -|- 1 points. For one-dimensional parametric prob- 
lems, it is natural to assume that the number of k such that p{l — p)Dp{q\\pk) < 
1 — exp{—rn^/N‘^) is 0{m) for m > 1. This implies that WN = 

ln ^ E ”/2 < ln ^ O ( m )( e -™'/^')”/2 = 0 ( 1 ). 

j \ H\ J / ^ 

Since ttj = l/N, the localized entropy 




is a constant when N = 0(n^/^). Therefore with a discretization size N = 
Theorem 5 implies a convergence rate of the correct order 0(l/n). 

5.3 The Standard MDL (A = 1) 

The standard MDL with A = 1 in (7) is more complicated to analyze. It is not 
possible to give a bound similar to Theorem 4 that only depends on the index 
of resolvability. As a matter of fact, no bound was established in [1]. As we will 
show later, the method can converge very slowly even if the index of resolvability 
is well-behaved. 

However, it is possible to obtain bounds in this case under additional assump- 
tions on the rate of decay of the prior tt. The following theorem is a straight- 
forward interpretation of Corollary 3, where we consider the family itself as an 
0-upper discretization: Ti = {pi}: 

Theorem 6. Consider the estimator defined in (7) with A = 1. Vp G (0, 1) and 
Vy > 1, we have: 



^xDp{q\\pff < 



yinffc 



DKL{q\\pk) + 



P(1 - P) 



1- P 
p(l - p)n 






The above theorem only depends on the index of resolvability and decay of 
the prior tt. If tt has a fast decay in the sense of ^ _|_qq 

does not change with respect to n, then the second term on the right hand side 
of Theorem 6 is 0(I/n). In this case the convergence rate is determined by the 
index of resolvability. The prior decay condition specified here is rather mild. 
This implies that the standard MDL is usually Hellinger consistent when used 
with care. 
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5.4 Slow Convergence of the Standard MDL 

The purpose of this section is to illustrate that the index of resolvability cannot 
by itself determine the rate of convergence for the standard MDL. We consider a 
simple example related to the Bayesian inconsistency counter-example given in 
[2], with an additional randomization argument. Note that due to the random- 
ization, we shall allow two densities in our model class to be identical. It is clear 
from the construction that this requirement is for convenience only, rather than 
anything essential. 

Given a sample size n, and consider an integer m such that m ^ n. Let 

the space X consist of 2m points {!,... ,2m}. Assume that the truth q is the 

uniform distribution: q{u) = l/2m for u = 1 , . . . 2m. 

Consider a density class F' consisted of all densities p such that either p{u) = 

0 or p{u) = 1/m. That is, a density p in F' takes value 1/m at m of the 2m 

points, and 0 elsewhere. Now let our model class F be consisted of the true 
density q with prior 1/4, and 2" densities pj {j = 1, . . . ,2”) that are randomly 
(and uniformly) drawn from F', each with the same prior 3/2"“'"^. 

We shall show that for a sufficiently large integer m, with large probability 
we will estimate one of the 2” densities from F' with probability of at least 

1 — Since the index of resolvability is ln4/n, which is small when n is 

large, the example implies that the convergence of the standard MDL method 
cannot be characterized by the index of resolvability alone. 

Let X = {Xi, . . . , Xn} be a set of n-samples from q and p be the estimator 
from (7) with A = 1 and F randomly generated above. We would like to estimate 
P{p = 9) • By construction, p = q only when Y\a=i Pj ^ Pj ^ F' C\F. 

Now pick m large enough such that (m — n)'^ /mP > 0.5, we have 



P{p = q) =P GF'dF: J[pj{X,) = oj 

=Ex P = 0|A^ < Ex (1 - ) < e-° 

where |A| denotes the number of distinct elements in X. Therefore with a con- 
stant probability, we have p ^ q no matter how large n is. 

This example shows that it is not possible to obtain any rate of convergence 
result using index of resolvability alone. In order to estimate convergence, it is 
thus necessary to make additional assumptions, such as the prior decay condi- 
tion of Theorem 6. We shall also mention that from this example together with 
a construction scheme similar to that of the Bayesian inconsistency counter ex- 
ample in [2], it is not difficult to show that the standard MDL is not Bellinger 
consistent even when the index of resolvability approaches zero as n — >■ 00. For 
simplicity, we skip the detailed construction in this paper. 
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5.5 Weak Convergence of the Standard MDL 

Although Hellinger consistency cannot be obtained for standard MDL based on 
index of resolvability alone, it was shown in [1] that as n — >■ oo, if the index of 
resolvability approaches zero, then pj, converges weakly to q. Therefore MDL is 
effectively weakly consistent as long as q belongs to the information closure of 
r. This result is a direct consequence of Theorem 3, which we shall restate here: 



Theorem 7. Consider the estimator defined in (7) with A = 1. Then V/ : A — >■ 
[— 1, 1], we have: 



Ex 



Ep,/(x) 



1 

n 






< 2A„ + \/2A„, 



where An 



inffc 



DKL{q\\pk) + 



In 2 
n 



Note that this theorem essentially implies that the standard MDL estimator 
is weakly consistent as long as the index of resolvability approaches zero when 
n — >■ 0. Moreover, it establishes a rate of convergence result which only depends 
on the index of resolvability. This theorem improves the consistency result in [1], 
where no rate of convergence results were established, and / was assumed to be 
an indicator function. 



6 Discussions 

This paper studies certain randomized (and deterministic) density estimation 
methods which we call information complexity minimization. We introduced a 
general KL-complexity based convergence analysis, and demonstrated that the 
new approach can lead to simplified and improved convergence results for two- 
part code MDL, which improves the classifical results in [1]. 

An important observation from our study is that generalized information 
complexity minimization methods with regularization parameter A > 1 are more 
robust than the corresponding standard methods with A = 1. That is, their con- 
vergence behavior is completely determined by the local prior density around the 
true distribution measured by the model resolvability inf^jgs For MDL, 

this quantity (index of resolvability) is well-behaved if we put a not too small 
prior mass at a density that is close to the truth q. We have also demonstrated 
through an example that the standard MDL does not have this desirable prop- 
erty in that even we can guess the true density by putting a relatively large prior 
mass at the true density q, we may not estimate q very well as long as there exits 
a bad (random) prior structure even at places very far from the truth q. 
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A Proof of Lemma 2 

Applying the convex duality in Proposition 1 with f{x) = —^ pl^'\0) ’ 

we obtain 

(«T ) > - ^ In E, exp 1 g In , 

Taking expectation and using Jensen’s inequality with the convex function 

^/>(x) = — ln(x), we obtain 

ExSa-OT) > - ^ In Ex E. exp (-^glngb^ > 0. 

B Proof of Lemma 3 

The proof is similar to that of Lemma 2, but with a slightly different estimate. 

We again start with the inequality 

Sa. («T ) > - ^ In E, exp i g In , 
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Taking expectation and using Jensen’s inequality with the convex function 
■0(x) = — ln(x), we obtain 



-'ExRyi'Wx) <-lnExE^'exp ( - 



g(^^) 



^ — In Ex 
n 



^7r(E,)exp --^In 



q{Xi) 



A'-^ supgg^.p(X,|6l) 



^ — In Ex 
n 



= -ln 
n 



< — In 
n 



exp ( -X^ln 

3 

n 

3 *=1 

1 

^7r(T,)^'(l + e) 



q{Xi) 



^ supggp.p(X,|6»)^ 

sup0gr,p(^*l^') 



A' 



The third inequality follows from the fact that VA^ € [0, 1] and positive numbers 
{aj}: < Ei«f • 
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Abstract. We show that forms of Bayesian and MDL inference that 
are often applied to classification problems can be inconsistent. This 
means there exists a learning problem such that for all amounts of data 
the generalization errors of the MDL classifier and the Bayes classifier 
relative to the Bayesian posterior both remain bounded away from the 
smallest achievable generalization error. 



1 Introduction 

Overfitting is a central concern of machine learning and statistics. Two frequently 
used learning methods that in many cases ‘automatically’ protect against over- 
fitting are Bayesian inference [5] and the Minimum Description Length (MDL) 
Principle [21,2,11]- We show that, when applied to classification problems, some 
of the standard variations of these two methods can be inconsistent in the sense 
that they asymptotically overfit: there exist scenarios where, no matter how much 
data is available, the generalization error of a classifier based on MDL or the full 
Bayesian posterior does not converge to the minimum achievable generalization 
error within the set of classifiers under consideration. 

Some Caveats and Warnings. These result must be interpreted carefully. There 
exist many different versions of MDL and Bayesian inference, only some of which 
are covered. For the case of MDL, we show our result for a two-part form of 
MDL that has often been used for classification. For the case of Bayes, our 
result may appear to contradict some well-known Bayesian consistency results 
[6]. Indeed, our result only applies to a ‘pragmatic’ use of Bayes, where the set 
of hypotheses under consideration are classifiers: functions mapping each input 
A to a discrete class label Y. To apply Bayes rule, these classifiers must be 
converted into conditional probability distributions. We do this conversion in a 
standard manner, crossing a prior on classifiers with a prior on error rates for 
these classifiers. This may lead to (sometimes subtly) ‘misspecified’ probability 
models not containing the ‘true’ distribution D. Thus, our result may be restated 
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as ‘Bayesian methods for classification can be inconsistent under misspecification 
for common classification probability models’. The result is still interesting, since 
(1) even under misspecification, Bayesian inference is known to be consistent 
under fairly broad conditions - we provide an explicit context in which it is 
not; (2) in practice, Bayesian inference is used frequently for classification under 
misspecification - see Section 6. 



1.1 A Preview 



Classification Problems. A classification problem is defined on an input (or 
feature) domain X and output domain (or class label) y = {0, 1}. The problem 
is defined by a probability distribution D over X x y. A classifier is a function 
c X ^ y The error rate of any classifier is quantified as: 



en(c) = E(^^^y)^Dl{c{x) ^ y) 



where {x, y) ^ D denotes a draw from the distribution D and /(•) is the indicator 
function which is 1 when its argument is true and 0 otherwise. 

The goal is to find a classifier which, as often as possible according to D, 
correctly predicts the class label given the input feature. Typically, the classifi- 
cation problem is solved by searching for some classifier c in a limited subset C 
of all classifiers using a sample S = (xi,yi ), . . . , {xm,ym) ^ D'^ generated by 
m independent draws from the distribution D. Naturally, this search is guided 
by the empirical error rate. This is the error rate on the subset S defined by: 



es(c) 



E(x,y)r^sl{c{x) yf y) 



1 



^ /(c(x)yfy). 
{x,y)es 



where (x, y) ^ S denotes a sample drawn from the uniform distribution on S. 
Note that es(c) is a random variable dependent on a draw from Z?’”. In contrast, 
C£)(c) is a number (an expectation) relative to D. 



The Basic Result. Our basic result is that certain classifier learning algorithms 
may not behave well as a function of the information they use, even when given 
infinitely many samples to learn from. The learning algorithms we analyze are 
“Bayesian classification” (Bayes), “Maximum a Posteriori classification” (MAP), 
and “Minimum Description Length classification” (MDL) . These algorithms are 
precisely defined later. Functionally they take as arguments a training sample S 
and a “prior” P which is a probability distribution over a set of classifiers C. In 
Section 3 we state our basic result. Theorem 2. The theorem has the following 
corollary, indicating suboptimal behavior of Bayes and MDL: 

Corollary 1. (Classification Inconsistency) There exists an input domain 
X , a prior P always nonzero on a countable set of classifiers C, a learning prob- 
lem D, and a constant K > 0 such that the Bayesian classifier cbayes(p.S); the 
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MAP classifier CyiAp(p,s) ! MDL classifier CMBh(p,s) asymptotically K- 

suboptimal. That is, for each e G {en(cBAYEs(P,S)), e_D(cMAp(p.s)), ei)(cMDL(p.s))}, 
we have 

lim Pr { e > K + inf eoic) ) = 1- 
m— >-oo S'^D^ y cGC J 

How dramatic is this result? We may ask (1) are the priors P for which the 
result holds natural; (2) how large can the constant K become and how small 
can infcgceD(c) be? (3) perhaps demanding an algorithm which depends on 
the prior P and the sample S to be consistent (asymptotically optimal) is too 
strong? The short answer to (1) and (2) is: the priors P have to satisfy several 
requirements, but they correspond to priors often used in practice. K can be 
quite large and inf^ cd^c) can be quite small - see Section 5.1 and Figure 1. 

The answer to (3) is that there do exist simple algorithms which are consis- 
tent. An example is the algorithm which minimizes the Occam’s Razor bound 
(ORB) [7], Section 4.2. 

Theorem 1. (ORB consistency) For all priors P nonzero on a set of clas- 
sifiers C, for all learning problems D, and all constants K > 0 the ORB classifier 
Corb(p,s) is asymptotically K-optimal: 

lim Pr ( ei)(coRB(p.s)) > K + inf ei)(c) ) = 0. 

m— >oo S~D'^ y cGC y 

The remainder of this paper first defines precisely what we mean by the above 
classifiers. It then states the main inconsistency theorem which implies the above 
corollary, as well as a theorem that provides an upper-bound on how badly Bayes 
can behave. In Section 4 we prove our theorems. Variations of the result are 
discussed in Section 5.1. A discussion of the result from a Bayesian point of view 
is given in Section 6. 



2 Some Classification Algorithms 

The basic inconsistency result is about particular classifier learning algorithms 
which we define next. 



The Bayesian Classification Algorithm. The Bayesian approach to infer- 
ence starts with a prior probability distribution P over a set of distributions P 
which typically represents a measure of “belief” that some p G V is the pro- 
cess generating data. Bayes’ rule states that, given sample data S, the posterior 
probability P{- \ S) that some p is the process generating the data is: 



P{p I 5) 



p{S)P{p) 

P{S) 



where P{S) := Ep^pp(S). In classification problems with sample size m = 
[S'!, each p G V is & distribution on (A x V)™ and the outcome S = 
(xi,yi ), . . . , {xm, ym) is the sequence of labeled examples. 
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If we intend to perform classification based on a set of classifiers C rather 
than distributions V, it is natural to introduce a “prior” P(c) that a particular 
classifier c : fb — >■ {0, 1} is the best classifier for solving some learning problem. 
This, of course, is not a Bayesian prior in the conventional sense because classi- 
fiers do not induce a measure over the training data. It is the standard method 
of converting a “prior” over classifiers into a Bayesian prior over distributions 
on the observations which our inconsistency result applies to. 

One common conversion [14,22,12] transforms the set of classifiers C into a 
simple logistic regression model - the precise relationship to logistic regression 
is discussed in Section 5.2. In our case c(x) G {0, 1} is binary valued, and then 
(but only then) the conversion amounts to assuming that the error rate 9 of the 
optimal classifier is independent of the feature value x. This is known as “ho- 
moskedasticity” in statistics and “label noise” in learning theory. More precisely, 
it is assumed that, for the optimal classifier c G C, there exists some 6 such that 
Vx P{c{x) y) = 6. Given this assumption, we can construct a conditional 
probability distribution over the labels given the unlabeled data: 

I a;™) = (1) 



For each fixed 9 < 0.5, the log likelihood logPc,e(2/’” I a:™) is linearly decreasing 
in the empirical error that c makes on S. By differentiating with respect to 9, we 
see that for fixed c, the likelihood (1) is maximized by setting 9 := es(c), giving 



log 



1 

Pc,es(c){y'^ I a;™) 



mH{es{c)). 



( 2 ) 



where H is the binary entropy = — /xlog/i — (1 — /i)log(l — fj,), which is 

strictly increasing for es(c) G [0,0.5). We further assume that some distribution 
Px on T™ generates the x- values^. We can apply Bayes rule to get a posterior 
on pcfi, denoted as P{c,9 \ S), without knowing px, since the Pa,(a;'")-factors 
cancel: 



P{c,9\S) 



PcAy""\^'")pA^'")P{c,d) 

p^yva I xAPx{xA 



PcAyAxAPA s) 
Ec,e~pPcfi{y'^ I a^™) ' 



(3) 



To make (3) applicable, we need to incorporate a prior measure on the joint 
space C X [0, 1] of classifiers and 0-parameters. In the next section we discuss the 
priors under which our theorems hold. 

Bayes rule (3) is formed into a classifier learning algorithm by choosing the 
most likely label given the input x and the posterior P(-|S'): 



cbayes(p,s) (a^) 



1 if Fic.e~p(-|S)Pc,e(M = 1|X = x) > i, 
0 otherwise. 



(4) 



^ And, in particular that this distribution is independent of c and 6 . 
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The MAP classification Algorithm. The integrations of the full Bayesian 
classifier can be too computationally intensive, so we sometimes predict using 
the Bayesian Maximum A Posteriori (MAP) classifier. This classifier is given by: 

Cmap(p,s) = argrng^ max^P(c, 0 | S) = argm^ m^ax^Pc,e(y'” I x^)P{c,9) 

with ties broken arbitrarily. Integration over 9 G [0, 1] being much less problem- 
atic than summation over c G C, one sometimes uses a learning algorithm which 
integrates over 9 (like full Bayes) but maximizes over c (like MAP): 

CsMAp(p.s) = argmaxF(c | S) = argmax Eg^p(^ 0 )Pc,e{y"" \ x^)P{c \ 9). 

C t w C t w 



The MDL Classification Algorithm. The MDL approach to classification 
is transplanted from the MDL approach to density estimation. There is no such 
thing as a ‘definition’ of MDL for classification because the transplant has been 
performed in various ways by various authors. Nonetheless, most implementa- 
tions are essentially equivalent to the following algorithm [20,21,15,12]: 

1 / TTl \ 

= arg mm log + log j . (5) 

The quantity minimized has a coding interpretation: it is the number of bits 
required to describe the classifier plus the number of bits required to describe 
the labels on S given the classifier and the unlabeled data. We call — log P(c) -I- 
log the two-part MDL codelength for encoding data S with classifier c. 



3 Main Theorems 



In this section we prove the basic inconsistency theorem. We prove inconsistency 
for some countable set of classifiers C = {cq, Ci, . . . } which we define later. The 
inconsistency is attained for priors with ‘heavy tails’, satisfying 



log 



1 

P(Cfc) 



< log A: -I- o(logfc). 



( 6 ) 



This condition is satisfied, by, for example, Rissanen’s universal prior for the 
integers, [21]. The sensitivity of our result to the choice of prior is analyzed 
further in Section 5.1. The prior on 9 can be any distribution on [0, 1] with a 
continuously differentiable density P bounded away from 0, i.e. for some 7 > 0, 



for all 9 G [0, 1], P(6*) > 7. (7) 

For example, we may take the uniform distribution with P{9) = 1. We assume 
that the priors P{9) on [0, 1] and the prior P(c) on C are independent, so that 
P(c, 9) = P{c)P{9). In the theorem, H{p) = —plogp— {I — p) log(l — p) stands 
for the binary entropy of a coin with bias /i. 
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Theorem 2. (Classification Inconsistency) There exists an input space X 
and a countable set of classifiers C such that the following holds: let P he any 
prior satisfying (6) and (1). For all p, G (0,0.5) and all p' € [p,JI(p)/2), there 
exists a D with mincgc gd{c) = p such that, for all large m, all (5 > 0, 

(eD(cMAp(p,s)) = h') > ^ - am 

(ep(csMAp(p,s)) = h') > ^ - am 

sJp™ (ep(cMDL(p,s)) = h') am, 

(ep(cBAYEs(P,S)) > m' - <5) >l-am, where a„ = 3exp(-2Vm). 

The theorem states that Bayes is inconsistent for all large m on a fixed distri- 
bution D. This is a significantly more difficult statement than “for all (large) m, 
there exists a learning problem where Bayes is inconsistent”^. Differentiation of 

0. 51L(/i) — p shows that the maximum discrepancy between ep(cMAp(p,s)) and p 
is achieved for p= 1/5. With this choice of p, 0.5H{p) — p = 0.1609 ... so that, 
by choosing p' arbitrarily close to H{p), the discrepancy p' — p comes arbitrarily 
close to 0.1609 .... These findings are summarized in Figure 1. 

How large can the discrepancy between p = infcCp(c) and p' = 
ep(cBAYEs(P,S)) be in the large m limit, for general learning problems? Our 
next theorem, again summarized in Figure 1, gives an upperbound, namely, 
p' < H{p): 

Theorem 3. (Maximal Inconsistency of Bayes) Let S'* be the sequence 
consisting of the first i examples {xi,yi ), . . . , (xi,yi). For all priors P nonzero 
on a set of classifiers C, for all learning problems D with infcgc ep(c) = p, for 
all i5 > 0, for all large m, with -probability > 1 — exp(— 2ySn), 

^ m 

CBAYEs(p,si-i)(a;*)| < H{p) -h(5. 

‘ i=l 

The theorem says that for large m, the total number of mistakes when suc- 
cessively classifying yi given Xi made by the Bayesian algorithm based on 
S*“^, divided by m, is not larger than H{p). By the law of large numbers, 
it follows that for large m, ep(cBAYEs(P,S“-i)(3^i))> averaged over all i, is no 
larger than H{p). Thus, it is not ruled out that sporadically, for some i, 
ep(cBAYEs(P,S“-i)(2^i)) > H(p); but this must be ‘compensated’ for by most other 

1. We did not find a proof that ep(cBAYEs(p,s*-i)(a^i)) < for all large i. 

4 Proofs 

In this section we present the proofs of our three theorems. Theorem 2 and 3 
both make use of the following lemma: 

^ In fact, a meta-argument can be made that any nontrivial learning algorithm is 
‘inconsistent’ in this sense for finite m. 
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Fig. 1. A graph depicting the set of asymptotically allowed error rates for different 
classification algorithms. The a;-axis depicts the optimal classifier’s error rate n (also 
shown as the straight line). The lower curve is just 0.5i7(/r) and the upper curve is 
Theorem 2 says that any (/r, /r') between the straight line and the lower curve 
can be achieved for some learning problem D and prior P. Theorem 3 shows that the 
Bayesian learner can never have asymptotic error rate n' above the upper curve. 



Lemma 1. There exists 7 > 0 such that for all classifiers c, a > 0, m > 0, all 
S ~ D™ satisfying a + Ijy/m < es(c) < 0.5, all priors satisfying (7): 



log- 



P{y 



m rpm 



1 

es(c)) 
log 



< log 



P{y 



m rpTn 



< 



1 



P{y 



m rpTYl 



1 11 

+ -logm+ r-log 7 . ( 8 ) 

x"\c,es\c)) 2 2 q;( 1 — Of) 



Proof, (sketch) For the first inequality, note 



I a;"*, c) / P{y’^ I a;”*, c, e)P{9)d9 ~ P{y^ | c, es(c)) ’ 

since the likelihood P{y'^ \ x'^,c,9) is maximized at 9 = es{c). For the second 
inequality, note that 




/•es(c) + l/\An 
^,c,9)P{9)d6 > / exp(logP(i/ 

J ea(c) — l/Orn 



es{c)--l/y/m 



x^,c,9)+\ogP(9))d9. 
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We obtain (8) by expanding log P{y^ \ x™, c, 0) around the maximum 9 = es{c) 
using a second-order Taylor approximation. See, [2] for further details. 



4.1 Inconsistent Learning Algorithms: Proof of Theorem 2 

Below we first define the particular learning problem that causes inconsistency. 
We then analyze the performance of the algorithms on this learning problem. 



The Learning Problem. For given fj. and fi' > /x, we construct a learning 
problem and a set of classifiers C = {cq, ci, . . . } such that cq is the ‘good’ classifier 
with e£)(co) = y, and ci, C 2 , . . . are all ‘bad’ classifiers with eoicj) = yf > y. X 
consists of one binary feature per classifier^, and the classifiers simply output 
the value of their special feature. The underlying distribution D is constructed 
in terms of /i and yf and a proof parameter /x^ard > | (the error rate for “hard” 
examples). To construct an example (x,y), we first flip a fair coin to determine 
y, so y = 1 with probability 1/2. We then flip a coin with bias Phard := — — 
which determines if this is a “hard” example or an “easy” example. Based upon 
these two coin flips, each Xj is independently generated based on the following 
3 cases. 

1. For a “hard” example, and for each classifier Cj with j > 1, set Xj = |1 — p| 
with probability /Xhard and Xj = y otherwise. 

2. For an “easy” example, and every j > 1 set Xj = y. 

3. For the “good” classifier cq (with true error rate p), set a;o = |1 — y\ with 
probability /x and xq = y otherwise. 

The error rates of each classifier are e£)(co) = /x and eu{cj) = y! for all j > 1. 



Bayes and MDL are inconsistent. We now prove Theorem 2. In Stage 1 
we show that there exists a km such that for every value of m, with probability 
converging to 1, there exists some ‘bad’ classifier Cj with 0 < j < km that 
has 0 empirical error. In Stage 2 we show that the prior of this classifier is 
large enough so that its posterior is exponentially larger than that of the good 
classifier cg, showing the convergence 6 £)(cmap(p,s)) — >■ pb In Stage 3 we sketch 
the convergences ep(csMAp(p,s)) m', ep(cMDL(p,s)) m', ep(cBAYEs(P,S)) m'- 



Stage 1. Let mhard denote the number of hard examples generated within a 
sample S of size m. Let A: be a positive integer and Ck = {cj & C \ 1 < j < k}. 

® This input space has a countably infinite size. The Bayesian posterior is still com- 
putable for any finite m if we order the features according to the prior of the as- 
sociated classifier. We need only consider features which have an associated prior 
greater than ^ since the minus log-likelihood of the data is always less than m 
bits. Alternatively, we could use stochastic classifiers and a very small input space. 
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For all e > 0 and m > 0, we have: 
gP^^(Vc G Ck : es(c) > 0) 

13 (v! ^ n ^hard i ^ 13 ''Tl-hard \ 

+ Pr (Vc G Cfc : es(c) > 0 | + g') Pr + e'j 

s~D™ V m / s~D<^ \ m J 

^ g- 2 me^ Pr (VcGCfc: es(c) > 0 | + g^j 

s~D"' \ m / 

< + (!_(!_ /ihard)™^^‘'“‘*+'^^)'' < (9) 

Here (a) follows because P{a) = X)h P(a|6)P(5). (b) follows by Va, P : P{a) < 1 
and the Chernoff bound, (c) holds since (1 — (1 — /Thard)’”*-^'*"''^*^^)* is monotonic 
in e, and (d) by Vx G [0, 1], A: > 0 : (1 — x)^ < e“^^. We now set Cm ■= 

and k(m) = — z — r- Then (9) becomes 

^ ' (l-Mhard)"*(*’hard+-m) V J 

^Pr^(Vc G Ckim) ■■ es{c) > 0) < 2e-2v^ (10) 

On the other hand, by the Chernoff bound we have Prs^um (es(co) < ei)(co) — 
Cm) < g- 2 \/m fQj. optimal classifier cq. Combining this with (10) using the 
union bound, we get that, with H^-probability larger than 1 — 3e“^'/™, the 
following event holds: 

3cGCfc(„): es(c) = 0 and es(co) > en(co) - e^. ( 11 ) 

Stage 2. In the following derivation, we assume that the large probability event 
(11) holds. We show that this implies that for large m, the posterior on some 
c* G Ck(ra) with es(c*) = 0 is greater than the posterior on cq, which implies 
that the MAP algorithm is inconsistent. Taking the log of the posterior ratios, 
we get: 

maxgP(co, 6 > | x’^,y’^) _ maxg P(co)P( 6 >)P(y"^ | x"",cq, 6 >) _ 

maxgP(c*, 6 l I x™,j/™) maxg P(c*)P(0)P(j/™ |x™,c*,0) 

logmaxP(co)P( 6 »)P( 2 /™ | x™, cq, 6 ») - logmaxP(c*)P( 6 »)P(y™ | x™,c*, 6 »). (12) 

9 9 

Using (2) we see that the leftmost term is no larger than 

log (maxP(co)P( 6 »)) • (maxP(?/’” | x’”,co, 6 '')) = -mH{es{co)) + 0(1) < 

9 9 ' 

— mi?(e£)(co)) — Kmem + 0(1) = —mH{fj,) — mP'^^K + 0(1) (13) 

where K is some constant. The last line follows because FA(/i) is continuously 
differentiable in a small enough neighborhood around /i. 

For the rightmost term in (12), by the condition on prior p{9), (7), 

-logmaxP(c*)P( 6 »)P(?/’” | x™,c*, 6 ») < - log P(c*) + logy. (14) 
9 
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Using condition (6) on prior P{c*) and using c* G Ck(m): we find: 



log 



1 

P(c*) 



< logA:(m) + o(logfc(m)), 



( 15 ) 



where logA:(m) = log2v^ - (mphard + log(l - ^ihard)- Choosing /Zhard = 

1/2, this becomes logfc(m) = |logm + 2m/x' + + 0(1). Combining this 

with (15), we find that 



log 



1 

P(c*) 



< 2m/i' + o(m) 



(16) 



which implies that (14), is no larger than 2m/i' + o(m). Since fi' < H{fi)/2, the 
difference between the leftmost term (13) and the rightmost term (14) in (12) 
is less than 0 for large m, implying that then ei)(cMAp(p,s)) = ■ We derived all 

this from (11) which holds with probability > 1 — 3exp(— 2y^). Thus, for all 
large m, Pr (cmap(p,s) = > 1 — 3exp(— 2y^), and the result follows. 



Stage 3. (sketch) The proof that the integrated MAP classifier Csmap{p,s) is 
inconsistent is similar to the proof for c„ap(p,s) that we just gave, except that 
(12) now becomes 

logP(co)P(y’” I x’”,co) -logP(c*)P(y’” I x^,c*). (17) 

By Lemma 1 we see that, if (11) holds, the difference between (12) and (17) is 
of order O(logm). The proof then proceeds exactly as for the MAP case. 

To prove inconsistency of Cmdl(p,s)> note that the MDL code length of 
y™ given x™ according to cq is given by log If (11) holds, then a 

simple Stirling’s approximation as in [12] or [15] shows that log (^g™(co)) “ 
mH{es{co)) — O(logm). Thus, the difference between two-part codelengths 
achieved by cq and c* is given by 

-mH{es{co)) + O(logm) - logP(c*). (18) 

The proof then proceeds as for the MAP case, with (12) replaced by (18) and a 
few immediate adjustments. 

To prove inconsistency of cbayes(p,S)> we take /Xhard not equal to 1/2 but to 
1/2 + 5 for some small 5 > 0. By taking 5 small enough, the proof for Cmap(p,s) 
above goes through unchanged so that, with probability > 1 — 3exp(— 2y^), the 
Bayesian posterior puts all its weight, except for an exponentially small part, on 
a mixture of distributions Pc^ whose Bayes classifier has error rate p' and error 
rate on hard examples > 1/2. It can be shown that this implies that for large 
m, the classification error Cbayes(p,S) converges to /i'; we omit details. 



4.2 A Consistent Algorithm: Proof of Theorem 1 

In order to prove the theorem, we first state the Occam’s Razor Bound classifica- 
tion algorithm, based on minimizing the bound given by the following theorem. 
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Theorem 4. (Occam’s Razor Bound) [7] For all priors P on a countable set of 
classifiers C, for all distributions D, with probability 1 — S: 



Vc : eoic) < es{c) + 

We state the algorithm here in a suboptimal form, which good enough for our 
purposes (see [18] for more sophisticated versions): 




Corb(P,S) 



:= arg min es(c) 

cGC 




Proof of Theorem 1. Set 5m '■= 1/m. It is easy to see that 



min e£)(c) 
cG C 




is achieved for at least one c G C = {cq, ci, . . . }. Among all Cj £ C achieving the 
minimum, let Cm be the one with smallest index j. By the Chernoff bound, we 
have with probability at least 1 — 5m = 1 — 1 /m, 



^Dipm) ^ 5g{Cm) 



h\{l/5m) 

2m 






In m 
2m 



(19) 



whereas by Theorem 4, with probability at least 1 — (5m = 1 — 1/m, 



ep(coRB(p,s)) < min es(c)- 
cGC 



— In P(c) + In m 



2m 



< es{cm) + 



— In P{cm) + In m 



2m 



Combining this with (19) using the union bound, we find that 



N , /-lnP(cm) +lnm /inm 
ep(coRB(p.s)) < ep(cm) + Y + \j 

with probability at least 1 — 2/m. The theorem follows upon noting that the 
right-hand side of this expression converges to infcgc ep(c) with increasing m. 

4.3 Proof of Theorem 3 

Without loss of generality assume that cq achieves miucgc ep(c). Consider both 
the 0/1-loss and the log loss of sequentially predicting with the Bayes pre- 
dictive distribution P(Yi = ■ \ Xi = •,5'*“^) given by P{yi \ Xj,S'*“^) = 
E^0^P(^.\Si-i)Pc,e{yi\xi). Every time i G {I,--. ,m} that the Bayes classifier 
based on classifies yi incorrectly, P{yt \ must be < 1/2 so that 

— log P{yi I Si, 5*“^) > 1. Therefore, 

m m 

^-logP(y, I CBAYES(P,S*-I)(2^i)l- 

Z =1 2=1 



( 20 ) 
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On the other hand we have 

m m 

'^-logP{yi I = -logJ]^P(y* | x*, = 

7Tl 7Tl -p / 2 I ytL \ 

- I - logll pf.l.Ur “ I = 

-log ^ P(j/’" |x’",c,)P(c,)<-logP(y™ |x™,co)-logP(co), (21) 

i=o.i.2... 

where the inequality follows because a sum is larger than each of its terms. By 
the Chernoff bound, for all small enough e > 0, with probability larger than 
1 — 2exp(— 2me^), we have |e 5 (co) — e£)(co)| < e. We now set 
Then, using Lemma 1, with probability larger than 1 — 2exp(— 2i/m), for all 
large m (21) is less than or equal to 

-logP(j/™ I x™,co,e(co)) + ^logm + Cm = mH{es{co)) + ^logm + C™ < 

mP(e_D(co)) + iLm°’^^+ ^logm + Cm, (22) 

where Cm = {&d{co) — Cm — m“°-^)“^(l — ei)(co) + and K \s & 

constant not depending on S' = S™. Here (a) follows from Equation 2 and (b) 
follows because iL(/r) is continuously differentiable in a neighborhood of /i. 

Combining (22) with (20) and using Cm = 0(1) we find that with probability 
> 1 - exp(-2vSn), X]™ i \Vi ~ cbayes(p,s*-i)(^*)I ^ wiJ(e£,(co)) + o(m), QED. 

5 Technical Discussion 

5.1 Variations of Theorem 2 and Dependency on the Prior 

Prior on classifiers. The requirement (6) that — log P(cfc) > log fc + o(log A:) is 
needed to obtain (16), which is the key inequality in the proof of Theorem 2. If 
P(cfe) decreases at polynomial rate, but at a degree d larger than one, i.e. if 

- log P(cfc) = dlogfc + o(log A:), (23) 

then a variation of Theorem 2 still applies but the maximum possible dis- 
crepancies between fj, and n' become much smaller: essentially, if we require 
< n' < ^H(fi) rather than fJ, < fi' < \H(y) as in Theorem 2, then the ar- 
gument works for all priors satisfying (23). Since the derivative dH(y)/diJ, — >■ oo 
as /r I 0, by setting fj, close enough to 0 it is possible to obtain inconsistency for 
any fixed polynomial degree of decrease d. However, the higher d, the smaller 
/i = infcgc e_D(c) must be to get any inconsistency with our argument. 

Prior on error rates. Condition (7) on the prior on the error rates is satisfied 
for most reasonable priors. Some approaches to applying MDL to classification 
problems amount to assuming priors of the form p(0*) = 1 for a single 9* € 
[0, 1]. In that case, we can still prove a version of Theorem 2, but the maximum 
discrepancy between p and fj! may now be either larger or smaller than H (ff) /2 — 
fi, depending on the choice of 9*. 
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5.2 Properties of the Transformation from Classifiers to 
Distributions 

Optimality and Reliability. Assume that the conditional distribution of y given x 
according to the ‘true’ underlying distribution D is defined for all x € X, and let 
PD{y\x) denote its mass function. Define A{pc,e) as the Kullback-Leibler (KL) 
divergence [9] between pc^ and the ‘true’ conditional distribution pr>. 

A{pc,e) ■■= KL{pd\\pc, 0) = E^^^y)^D[-^ogpc,e{y\x) + logpD{y\x)]. 



Proposition 1. Let C be any set of classifiers, and let c* € C achieve 
mincgcei)(c) = ei)(c*). 

1. If eu{c*) < 1/2, then 

minZ\(pc,e) is uniquely achieved for (c,0) = (c*,ejj(c*)). 

c,0 ’ 

2. mine, gA{pc,g) = 0 iff Pc\eD(c>) is ‘true’, i.e. ifVx,y : Pc- ,eo(c*){y\x) = 

PD{y\x). 

Property 1 follows since for each fixed c, minggjg A{pc,g) is uniquely achieved 
for 9 = C£)(c) (this follows by differentiation) and satisfies ming A{pc,g) = 
A{pc,cd{c)) = H{eD{c)) — Kd, where Kd = E\fogpD{y\x)\ does not depend 
on c or 9, and is monotonically increasing for \i <\j2. Property 2 follows 
from the information inequality [9]. 

Proposition 1 implies that our transformation is a good candidate for turning 
classifiers into probability distributions. 

Namely, let V = {pa : a G A} be a set of i.i.d. distributions indexed by 
parameter set A and let P{a) be a prior on A. By the law of large numbers, 
for each a G A, m~^logPa(y"‘ \ x^)P{a) — >■ KL{pu\\pa). By Bayes rule, this 
implies that if the class P is ‘small’ enough so that the law of large numbers 
holds uniformly for all Pa G V, then for all e > 0, the Bayesian posterior will 
concentrate, with probability 1, on the set of distributions in V within e of the 
p* & P minimizing KL-divergence to D. In our case, if C is ‘simple’ enough so 
that the corresponding P = {pc,g : c € C,9 € [0, 1]} admits uniform convergence 
[12], then the Bayesian posterior asymptotically concentrates on the Pc-,e- & P = 
{Pc,g} closest to D in KL-divergence. By Proposition 1, this Pc-,g- corresponds 
to the c* € C with smallest generalization error rate ei)(c*) {pc-,g- is optimal 
for 0/1-loss), and for the 9* G [0,1] with 9* = eoic*) {pc-,g- gives a reliable 
impression of its prediction quality) . This convergence to an optimal and reliable 
Pc-,g- will happen if, for example, C has finite VC-dimension [12]. We can only 
get trouble as in Theorem 2 if we allow C to be of infinite VC-dimension. 

Logistic regression interpretation, let C be a set of functions A — >■ V, where y C 
M {y does not need to be binary- valued) . The corresponding logistic regression 
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model is the set of conditional distributions {pc ,/3 : c G C; /? G K} of the form 

g-/3c(x) 2 ^ 

Pc,/3(1 U) ■= 2 + 5 Pc,/3(0|a:) := ^ ^ ■ (24) 

This is the standard construction used to convert classifiers with real-valued 
output such as support vector machines and neural networks into conditional 
distributions [14,22], so that Bayesian inference can be applied. By setting C to 
be a set of {0, l}-valued classifiers, and substituting jS = ln(l — 9) — \n9, we see 
that our construction is a special case of the logistic regression transformation 
(24). It may seem that (24) does not treat y = I and y = 0 on equal footing, 
but this is not so: we can alternatively define a symmetric version of (24) by 
defining, for each c G C, a corresponding c' : T — >• {—1, 1}, c'{x) := 2c{x) — 1. 
Then we can set 

^-f3c{x) g0c{x) 

Pc,/3(1 I x) := g/3c(a;) _|_ g-0c{x) ’ I '= ^0c{x) _|_ g-0c{x) ' 

By setting f3' = 2/3 we see that Pc ,/3 as in (24) is identical to Pc,/ 3 ' as in (25), so 
that the two models really coincide. 

6 Interpretation from a Bayesian Perspective 

Bayesian Consistency. It is well-known that Bayesian inference is strongly 
consistent under very broad conditions. For example, when applied to our setting, 
the celebrated Blackwell-Dubins consistency theorem [6] says the following. Let 
C be countable and suppose D is such that, for some c* £ C and 6* G [0,1], 
Pc*,e* is equal to po, the true distribution/ mass function of y given x. Then 
with H-probability 1, the Bayesian posterior concentrates on c*: limm_>oo P(c* \ 
S^) = 1 . 

Consider now the learning problem underlying Theorem 2 as described in 
Section 4.1. Since cq achieves miucgc ez)(c), it follows by part 1 of Proposition 1 
that minc,e ^(Pc.e) = ^{Pco,eD(co))- If ^(Pco,ec(co)) were 0, then by part 2 of 
Proposition 1, Blackwell-Dubins would apply, and we would have P{cq \ S'™) — f 
1. Theorem 2 states that this does not happen. It follows that the premisse 
2i(Pco,eD(co)) = 0 must be false. But since A{pcfi) is minimized for (cq, e£i(co)), 
the Proposition implies that for no c £ C and no 9 £ [0, 1], is equal to pd(-|-) 
- in statistical terms, the model V = {pc,e '■ c £C, 9 £ [0, 1]} is misspecified. 



Why is the result interesting for a Bayesian? Here we answer several 
objections that a Bayesian might have to our work. 

Bayesian inference has never been designed to work under misspecification. So 
why is the result relevant? 

We would maintain that in practice, Bayesian inference is applied all the time 
under misspecification in classification problems [12]. It is very hard to avoid 
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misspecification with Bayesian classification, since the modeler often has no idea 
about the noise-generating process. Even though it may be known that noise is 
not homoskedastic, it may be practically impossible to incorporate all ways in 
which the noise may depend on x into the prior. 



It is already well-known that Bayesian inference can he inconsistent even ifV is 
well-specified, i.e. if it contains D [10]. So why is our result interesting? 

The (in)famous inconsistency results by Diaconis and Freedman [10] are based 
on nonparametric inference with uncountable sets V. Their theorems require 
that the true p has small prior density, and in fact prior mass 0 (see also [1]). 
In contrast, Theorem 2 still holds if we assign Pco,e_D(co) arbitrarily large prior 
mass < 1, which, by the Blackwell-Dubins theorem, guarantees consistency if 
V is well-specified. We show that consistency may still fail dramatically if V is 
misspecified. This is interesting because even under misspecification, Bayes is 
consistent under fairly broad conditions [8,16], in the sense that the posterior 
concentrates on a neighborhood of the distribution that minimizes KL-divergence 
to the true D. Thus, we feel our result is relevant at least from the inconsistency 
under misspecification interpretation. 

So how can our result co-exist with theorems establishing Bayesian consistency 
under misspecification? 

Such results are typically proved under either one of the following two assump- 
tions: 

1. The set of distributions V is ‘simple’, for example, finite-dimensional para- 
metric. In such cases, ML estimation is usually also consistent - thus, for 
large m the role of the prior becomes negligible. In case V corresponds to a 
classification model C, this would obtain, for example, if C were finite or had 
finite VC-dimension. 

2. V may be arbitrarily large or complex, but it is convex: any finite mixture 
of elements of V is an element of V. An example is the family of Gaussian 
mixtures with an arbitrary but finite number of components [17]. 

Our setup violates both conditions: C has infinite VC-dimension, and the cor- 
responding V is not closed under taking mixtures. This suggests that we could 
make Bayes consistent again if, instead of P, we would base inferences on its 
convex closure V . Computational difficulties aside,this approach will not work, 
since we now use the crucial part (1) of Proposition 1 will not hold any more: the 
conditional distribution in V closest in KL-divergence to the true pD{y\x), when 
used for classification, may end up having larger generalization error (expected 
0/1-loss) than the optimal classifier c* in the set C on which V was based. We 
will give an explicit example of this in the journal version of this paper. Thus, 
with a prior on V, the Bayesian posterior will converge, but potentially it con- 
verges to a distribution that is suboptimal in the performance measure we are 
interested in. 




346 P. Griinwald and J. Langford 



How ‘standard’ is the conversion from classifiers to probability distributions on 
which our results are based? 

One may argue that our notion of ‘converting’ classifiers into probability distri- 
butions is not always what Bayesians do in practice. For classifiers which produce 
real-valued output, such as neural networks and support vector machines, our 
transformation coincides with the logistic regression transformation, which is a 
standard Bayesian tool; see for example [14,22]. But our theorems are based on 
classifiers with 0/ 1-output. With the exception of decision trees, such classifiers 
have not been addresses frequently in the Bayesian literature. Decision trees 
have usually been converted to conditional distributions differently, by assuming 
a different noise rate in each leaf of the decision tree [13]. This makes the set of 
all decision trees on a given input space X coincide with the set of all conditional 
distributions on X , and thus avoids the misspecification problem, at the cost of 
using a much larger model space. 

Thus, this a weak point in our analysis: we use a transformation that has 
mostly been applied to real- valued classifiers, whereas our classifiers are 0/1- 
valued. Whether our inconsistency results can be extended in a natural way to 
classifiers with real-valued output remains to be seen. The fact that the Bayesian 
model corresponding to such neural networks will still typically be misspecified 
suggests (but does not prove) that similar scenarios may be constructed. 
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Abstract. We give a new algorithm for learning intersections of half- 
spaces with a margin, i.e. under the assumption that no example lies 
too close to any separating hyperplane. Our algorithm combines random 
projection techniques for dimensionality reduction, polynomial thresh- 
old function constructions, and kernel methods. The algorithm is fast 
and simple. It learns a broader class of functions and achieves an expo- 
nential runtime improvement compared with previous work on learning 
intersections of halfspaces with a margin. 



1 Introduction 

The Perceptron algorithm and Perceptron Convergence Theorem are among the 
oldest and most famous results in machine learning. The Perceptron Convergence 
Theorem (see e.g. [10]) states that at most 4/p^ iterations of the Perceptron 
update rule are required in order to correctly classify any set S of examples 
which are consistent with some halfspace which has margin p on S. (Roughly 
speaking, this margin condition means that no example lies within distance p of 
the separating hyperplane; we give precise definitions later.) 

Since halfspace learning is so widely used in machine learning algorithms and 
applications, it is of great interest to develop efficient algorithms for learning in- 
tersections of halfspaces and other more complex functions of halfspaces. While 
this problem has been intensively studied, progress to date has been quite lim- 
ited; we give a brief overview of relevant previous work on learning intersections 
of halfspaces at the end of this section. 

Our results: toward Perceptron-like performance for learning inter- 
sections of halfspaces. In this paper we take a perspective similar to that 
of the original Perceptron Convergence Theorem by highlighting the role of the 
margin; our goal is to obtain results analogous to the Perceptron Convergence 
Theorem for learning intersections of halfspaces with margin p. (Roughly speak- 
ing, an intersection of t halfspaces has margin p relative to a data set if each of 
the defining halfspaces has margin p on the data set; we give precise definitions 
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Table 1. Bonnds on running time for learning intersections and arbitrary functions 
of t halfspaces with margin p. Each hi is a halfspace over R"; in the second line / 
denotes an arbitrary Boolean fnnction (not known a priori to the learner) on t bits. In 
each case the target function is assumed to have margin p. 
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later.) The margin is a natural parameter to consider; previous work by Arriaga 
and Vempala [3] on learning intersections of halfspaces has explicitly studied the 
dependence on this parameter. Since the Perceptron algorithm learns a single 
halfspace in time 0(l/p^), the ultimate goal in this framework would be an algo- 
rithm which can learn (say) an intersection of two halfspaces in time polynomial 
in 1/p as well. 

Table 1 summarizes our main results. For any constant t = 0(1) number of 
halfspaces (in our opinion this is the most interesting case) our learning algorithm 
runs in (l/p)‘^(*°®^/^) time, i.e. quasipolynomial in 1/p. This is an exponential 
improvement over Arriaga and Vempala’s previous result [3] which was an algo- 
rithm that runs in (l/p)“^^/^ ^ time. (Put another way, our algorithm can learn 
the intersection of 0(1) halfspaces with margin at least l/2'/^°s" in poly(n) time, 
whereas Arriaga and Vempala require the margin to be at least w(l / \f\og n) to 
achieve poly(n) runtime.) In fact, we can learn any Boolean function of t = 0(1) 
halfspaces, not just an intersection of halfspaces, in (l/p)‘^*^*°si/p) time. 

One can instead consider the number of halfspaces t as the relevant asymp- 
totic parameter and view p as fixed at 0(1). For this case we give an algorithm 
which has a dependence on t; this algorithm can learn an intersection 

of t many halfspaces in poly(n) time. In contrast, the previous 

algorithm of [3] has a dependence on t and thus runs in poly(n) time only 

^ = Q( iogfogn ) halfspaces. 

As described below all our results are achieved using simple iterative algo- 
rithms (in fact using simple variants of the Perceptron algorithm!). 

Our Approach. Our algorithm for learning an intersection of t halfspaces in R" 
with margin p is given in Figure 1. The algorithm has three conceptual stages: 
(i) random projection, (ii) polynomial threshold function construction, and (iii) 
kernel methods used to learn polynomial threshold functions. We now give a 
brief overview of each of these stages. 

Random Projection: Random projection for dimensionality reduction has 
emerged as a useful tool in many areas of CS theory. The key fact on which 
most of these applications are based is the Johnson-Lindenstrauss lemma [13] 
which shows that a random projection of a set of m points in R" into R^ (with 
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Algorithm A{EX{c,T>j): 

1. Let M he a. n X k random projection matrix. 

2. Draw m many examples from EX{c,T>) and project them to using M. 

3. Run the kernel Perceptron algorithm using the polynomial kernel Kd{x,y) = 
(x-y+l)'^ over the projected examples until a consistent hypothesis is obtained. 
Let h' be the kernel Perceptron hypothesis (a mapping from to {-1, 1}). 

4. Output h : R" — >■ { — 1, 1}, h{x) = sign(/i'(M^a;)) as the final hypothesis. 



Fig. 1. The algorithm is given access to a source EX (c, T>) of random labelled examples, 
where the target concept c is an intersection of t halfspaces over R" which has margin 
p with respect to distribution T>. The values of m, k and d are given in Section 6. 



k « ( 9 ( '°g^™ )) with high probability will not change pairwise distances by more 
than a (lie) factor. Arriaga and Vempala [3] were the first to give learning 
algorithms based on random projections. Their key insight was that since the 
geometry of a sample does not change much under random projection, one can 
run learning algorithms in the low dimensional space rather than R" and 
thus get a computational savings. 

As described in Section 3, the first step of our algorithm is to perform a 
random projection of the sample from R" into a lower dimensional space R^ 
where k has no dependence on n. After this projection, with high probability 
we have data points in R^ which are labelled according to some intersection of 
halfspaces with margin p/2. 

Polynomial Threshold Functions: Constructions of polynomial threshold 
functions (PTFs) have recently proved quite useful in computational learning 
theory; for example the DNF learning algorithm of [16] has at its heart the fact 
that any DNF formula can be expressed as a low degree thresholded polynomial 
sign(p(x)). The second conceptual step of our algorithm is to construct a poly- 
nomial threshold function for an intersection of halfspaces over R*. We show in 
Section 4 that any intersection of halfspaces with margin p/2 over R^ can be 
expressed as a low-degree polynomial threshold function p over R^. Moreover, 
unlike previous analyses (which only gave degree bounds) we show that this PTF 
p has nonnegligible PTF margin (we define PTF margin in Section 2.2). We can 
thus view our projected data in R^ as being labelled according to some degree-d 
PTF over R^ which has nonnegligible PTF margin. (We emphasize that this is 
only a conceptual rather than an algorithmic step - the learning algorithm itself 
does not have to do anything at this stage!) 

Kernel Methods: The third step is to learn the low-degree polynomial thresh- 
old function over R^. As shown in Section 5 we do this using the Perceptron 
algorithm with the standard polynomial kernel Kd{x,y) = (1 -I- a; • y)'^. The ker- 
nel Perceptron algorithm learns an implicit representation of a halfspace over an 
expanded feature space; here the expanded space has a feature for each mono- 
mial of degree up to d, and thus each example in R* corresponds to a point 
in Ri ). We show that since there is a polynomial threshold function which 
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correctly classifies the data in R* with some PTF margin, there must be a half- 
space over R1 <>■ > which correctly classifies the expanded data with a margin, 
and thus we can use kernel Perceptron to learn. 

Comparison with Previous Work. Many researchers have considered the 
problem of learning intersections of halfspaces. Efficient algorithms are known 
for learning intersections of halfspaces under the uniform distribution on the 
unit ball [7,21] and on the Boolean cube [15], but less is known about learning 
under more general probability distributions. Baum [4] gave an algorithm which 
learns an intersection of two origin-centered halfspaces under any symmetric 
distribution T> (which satisfies T>{x) = 'D{—x) for all x G R"), and Klivans et 
al. [15] gave a PTF-based algorithm which learns an intersection of 0(1) many 
poly(n)-weight halfspaces over {0, 1}" in time under any distribution. 

The most closely related previous work is that of Arriaga and Vempala [3] 
who gave an algorithm for learning an intersection of halfspaces with margin p; 
see Table 1 for a comparison with their results. Their algorithm uses random 
projection to reduce dimensionality and then uses a brute-force search over all 
(combinatorially distinct) halfspaces over the sample data. In contrast, our algo- 
rithm combines polynomial threshold functions and kernel methods with random 
projections, and is able to achieve an exponential runtime savings over [3]. 

2 Preliminaries 

2.1 Concepts and Margins 

A concept is simply a Boolean function c : R" — >■ {—1, +!}• A halfspace over R" 
is a Boolean function h : R" — >■ { — 1,1} defined by a vector w G R" and a value 
6* G R; given an input x G R", the value of h(x) is sign(w -x — d), i.e. h(x) = -1-1 
if w ■ X > d and h(x) = —1 if w ■ x < d. An intersection oft halfspaces hi, . ■ . ,ht 
is the Boolean AND of these halfspaces, i.e. the value is 1 if hi{x) = 1 for all 
i = 1, . . . ,t and is —1 otherwise. 

For two vectors x,y G R" we write ||a: — y\\ to denote the Euclidean distance 
between x and y and we write 5'”“^ for the unit ball in R". We have: 

Definition 1. Given X C R" and a concept c over R", write ||Aj] to denote 
max^gx ll-^ll- Ike say that c has (geometric) margin p with respect to X if 

p = min{l|z- 2 /l| : zGX,yG R",c{z) ^ c{y)}l\\X\\. 

Our definition of the geometric margin is similar to the notion of robustness de- 
fined in Arriaga and Vempala [3] ; the difference is that we normalize by dividing 
by the radius of the data set ||Arjj. In the case where |]Aj] = 1 these notions 
coincide and the condition is simply that for every z G X, every point within a 
ball of radius p around z has the same label as z under c. 

For V a probability distribution over R” we write Supp(21) to denote the set 
{x G R” : T>{x) > 0}. We say that c has margin p with respect to distribution T> if 
c has margin p on Supp(21). Thus, for T> a distribution where Supp(21) C 5'"“^, 
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an intersection of t halfspaces has margin p with respect to T> if every point in 
Supp(P) lies at least distance p away from each of the t separating hyperplanes. 

Throughout this paper we assume that: (i) All halfspaces in our intersection 
of halfspaces learning problem are origin-centered, i.e. of the form sign(w ■ x — 9) 
with 9 = 0- this can be achieved by adding an (n -|- l)st coordinate to each 
example, (ii) All examples lie on the unit ball 5'”“^ - this can be achieved by 
adding a new coordinate so that all examples have the same norm and rescaling. 

2.2 Polynomial Threshold Functions and PTF Margins 

Let / : R" — >■ {—1,1} be a Boolean function and AT be a subset of R". A 
real polynomial p in n variables is said to be a polynomial threshold function 
(PTF) for f over X if sign(p(x)) = f{x) for all x € X. The degree of a polyno- 
mial threshold function p is simply the degree of the polynomial p. Polynomial 
threshold functions are well studied in the case where X = {0, 1}" or {—1, 1}" 
(see e.g. [5,16,18,20]) but we will consider other more general subsets X. 

For S C {xi, . . . ,Xn} a multiset of variables, we write xs to denote the 
monomial Hies a polynomial, we write ||p|| to denote 

^2 norm of the vector of coefficients of p. Given a PTF p over 
X, we define the PTF margin of p over X to be min{|p( 2 ;)| : z G X}/||p||. Note 
that if p{x) = re • a: is a degree-1 polynomial which has ||p|| = \/wi + ■ ■ ■ + 

= 1, then the PTF margin of p over X is equal to the geometric margin of 
sign(p(x)) over X (up to scaling by ||Ai||). However in general for polynomials 
of degree greater than 1 these two notions are not equivalent. 

2.3 The Perceptron Algorithm and Kernel Perceptron 

Perceptron is a simple iterative algorithm which finds a linear separator for a 
labelled data set X C R" if such a separator exists. The algorithm maintains 
a weight vector w G R" and a bias 0 G R and updates these parameters addi- 
tively after each example; see e.g. Chapter 2 of [10] for details. The Perceptron 
Convergence Theorem bounds the number of updates in terms of the maximum 
margin of any halfspace (the following is adapted from Theorem 2.3 of [10]): 

Theorem 1. Let X C R" be a set of labelled examples such that there is some 
halfspace h (which need not be origin- centered) which has margin p over X. Then 
the Perceptron algorithm makes at most mistakes on X. 

Let (j) : R” — >■ R^ be a function which we call a feature expansion. We refer 
to R” as the original feature space and R'^ as the expanded feature space. The 
kernel corresponding to 4> is the function K{x,y) = 4>{x) ■ 4>{y) . The use of kernels 
in machine learning has received much research attention in recent years (see e.g. 
[10,12] and references therein). 

Given a data set X C R", it is well known (see e.g. [11]) that the Perceptron 
algorithm can be simulated over ^(A) in the expanded feature space R^ using 
the kernel function K(x,y) to yield an implicit representation of a halfspace in 
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R'^. If evaluating K{x, y) takes time T and the Perceptron algorithm is simulated 
until M mistakes are made on a data set X with \X\ = m, the time required is 
0{mTM‘^) (see e.g. [12,14]). 

3 Random Projections 

We say that an n x fc matrix M is a, random projection matrix if each entry of M 
is chosen independently and uniformly from {—1,1}. We will use the following 
lemma from Arriaga and Vempala [3] (see Achlioptas [1] for similar results): 

Lemma 1. [3] Fix p<l, 0<c<| and w G R” with ||rt;|| = 1. Let M be an 
n X k random projection matrix. For any x G R" we have 

Pr[w • a; - 2c < (M^w) • {M^x) <w-x + 2c]>l- > 1 - 

With this lemma in hand we can establish the main theorem on random 
projection which we will use: 

Theorem 2. Let X he a set of m points on and let h = sign{w ■ x) he a 

half space which has margin p on X. Let k > log(i^) and let M be a nx k 

random projection matrix. Let M{X) C R^ denote the projection of X under M 
and let h' : R^ — >■ {— 1,+1| denote the function h'{y) = sign{{M'^w) ■ y). Then 
with probability 1 — <5, the halfspace h' correctly classifies M{X) with margin at 
least I and we have | < ||M(A)|| < 2. 

Proof. We may assume that ||w|| = 1. After applying M to the points in X, we 
need to verify that Definition 1 is satisfied for h' with respect to the points in 
M{X). Setting c = ^ and setting k as above, taking a: = re in Lemma 1 we 
have that with probability at least 1 — ||M^w|p < Ijwjp + | = 1 + |, so 

||M^u;||<l+^. 

Now for each point z G X, applying Lemma 1 with x = z, with proba- 
bility at least 1 — ^ we have {w ■ z) — ^ < (M'^w) ■ (M'^z) < {w ■ z) + ^. 
Since |(w • z)| > p, this gives \{M'^w) ■ (M^z)| > Hence with probability 
at least 1 — | we have min{||z' — y\\ : z' G M{X),y G R^,/i'(z') h'{y)} > 

min^gjf \{M'^w) ■ {M"’" z)\/\\M'^w\\ > Lemma 1 similarly implies 

that 1— I < ||M(A)|| < 1+^ with probability at least 1 — f - Thus with 
probability 1 — 5, h' has margin at least | on M{X) and | < ||M(x)|| < 2. □ 

A union bound yields the following corollary: 

Corollary 1. Let X he a set of m points on 5'"“^ and let H = Ai=i = 
sign{w^ ■ x)A . . .A sign(w* ■ x) be an intersection of t halfspaces which has margin 
p on X. Let k > • log( ^^™* ) and let M be a n x k random projection 

matrix. Let M{X) C R^ denote the projection of X under M and let H' = 
Ai=i sign{{M'^ w‘^) -y). Then with probability 1 — <5, the intersection of halfspaces 
H' correctly classifies M{X) with margin at least ^ and ^ < ||M(A)|| < 2. 
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Thus with high probability the projected set of examples in is classified 
by an intersection of halfspaces with margin | . It is easy to see that the corollary 
in fact holds for any Boolean function (not just intersections) of t halfspaces. 

4 Polynomial Threshold Functions for Intersections of 
Halfspaces with a Margin 

In this section we give several constructions of polynomial threshold functions 
for intersections of halfspaces with a margin. In each case we give a PTF and also 
a lower bound on the PTF margin of the polynomial threshold function which 
we construct. These PTF margin lower bounds will be useful when we analyze 
the performance of kernel methods for learning polynomial threshold functions. 

In order to lower bound the PTF margin of a polynomial p we must upper 
bound ||p||. Fact 3 helps obtain such upper boundsd 

Fact 3 1. For i = 1, . . . let qi{x) = Ci^s^s be o degree-d polynomial over 

Xi,...,Xk with < M. Then \\qi{x) . . . qi{x)\\'^ < . 

2. For qi,...,qt, with \\qi\\^ < Mi, we have ||gi H < ^{Mi H VMi). 

4.1 Constructions Based on Rational Functions 

Recall that a rational function is a quotient of two real polynomials, i.e. Q{x) = 
a{x)/b{x). The degree of Q is defined as deg(a) + deg(&). Building on results of 
Newman [17] on rational functions which approximate the function jxj, in [6] 
Beigel et al. gave a construction of a low-degree rational function which closely 
approximates the function sgn(x). We will use the following (Lemma 9 of [6]): 

Lemma 2. [6] For all integers r,£ > 1 there is a univariate rational function 
P[{x) = of degree 0{£logr) with the following properties: (i) Pf{x) G 

[1, 1 + y] for all X G [1, 2^]; (ii) Pf{x) G [-1 - y, -1] for all x G [-2^, -1]; and 
(Hi) Each coefficient of a{x),b{x) has magnitude at most 2^^^ log’’). 

The following theorem generalizes Theorem 24 in [15], which addresses the 
special case of intersections of low-weight halfspaces over the space X = {0, 1}": 

Theorem 4. Let X he a subset ofRf with | < jjATjj < 2 and c : R^ — >■ {—1, 1} 
be an intersection oft origin- centered halfspaces h\, . . . ,ht- If c has margin p on 
X then there is a polynomial threshold function of degree d = 0(t log flog I) for 
c on X. If d < k then this PTF has PTF margin (p/fc)‘^(**°s‘*°si/p) x. 

Proof. We must exhibit a polynomial p{x) of the claimed degree such that for 
any z G X we have sign(p(z)) = c{z) and > (/;/p) 0 (tiogtiogi/p)_ 

^ Because of space restrictions all appendices are omitted in this version; see 
http://www.cs.columbia.edu/~'rocco/p6 Jong. pdf for the full version. 
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Let • a; = 0, . . . , tu* • a: = 0 be the t hyperplanes which define halfspaces 
hi,. . . , ht, we may assume without loss of generality that each ||w*|| = 1. Now 
consider the sum of rational functions 

Q{^) = ^log4/p(2(t«' • x)/p) + • • • + • x)/p) - t + 1/2. 

Fix any z G X. Since c has margin p on X and ^ < ||X|| < 2, for each i = 1, . . . ,t 
we have ^ < p\\X\\ < \w^ ■ z\ < |l'u;*|| • ||X|| < 2 and hence |2(w* • z)/p\ G [1, ^]. 

Consequently Piog 4 /p C^^p^'^ ) in [1, 1+^] if /i*(z) = 1 and lies in [—1—^, —1] 
if hi{z) = —1. Thus if hi{z) = 1 for all i we have Q{z) >t — t+^ = ^, and if 
hi{z) = —1 for some i we have Q(z) < — 1 + (t — 1) + — t + ^ So 

sign(Q( 2 ;)) = c{z) for all z G X. 

Since Q{x) is a sum of t rational functions of degree O(logtlog^), we can 
clear denominators and re-express Q{x) as a single rational function A(x)/B(x) 
of degree 0(tlogtlog ^). It follows that the function p{x) = A{x)B{x), which is 
a polynomial of degree 0(tlogtlog ^), has sign(p( 2 )) = sign(Q( 0 )) as desired. 

Now we must bound ||p||. We have || ^ so by part (1) of Fact 

3 we have that < (||)^ for all j. By Lemma 2 we have that 

-^iog 4 /p(^) = where a{x),b{x) are polynomials of degree O(logtlog^) with 
coefficients of magnitude at most = (L)C)('°g‘i°gi/c). It follows 

from part (2) of Fact 3 that ||a(^^)||2 < (|)0(iogii°gi/p).(i)0(iogtiogi/p) 
equals (^h'jO(iogtiogi/p) ^ same holds for ||6( 2™^*'^ )|p. Expressing Q{x) as 

a rational function A{x)/B{x), we have that B{x) = ]/[i=i H ^^p so since 
d<k part (1) of Fact 3 implies that \\B{x)f < fcO(tiogiiogi/p)(^)0(tiogtiogi/p) 
_ ^^^o(tiogtiogi/p)^ Simple calculations using part (1) of Fact 3 show that 

||Gl(a;)||2 and ||p(a;)|| = |l^(a;)i?(x)|| are also (/c/p)'^(*'°g‘i°gi/p), and we are done. 

□ 

By modifying this construction, we get a polynomial threshold function for 
any Boolean function of t halfspaces rather than just an intersection (at a rela- 
tively small cost in degree and PTF margin): 

Theorem 5. Let f : {—1,1}* — >■ {—1,1} be any Boolean function on t bits. 
Let X be a subset of R*' with | < ||X|| < 2 and c : R*' — >■ {—1,1} be the 
function f{hi, . . . , ht) where hi, . . . ,ht are origin- centered halfspaces in R^. Lf c 
has margin p on X then there is a PTF of degree d = 0{t^ log for c on X. Lf 

d < k then this PTF has PTF margin {p/k)^^^ logi/p) on X. 

Proof. As before, we give a polynomial p{x) of the claimed degree such that for 
any z G X we have sign(p(z)) = c{z) and > {k/ . 

Again let -x = Q, . . . , w*-x = 0 be the hyperplanes for halfspaces hi, . . . , ht, 
where each w* is a unit vector. For each i = 1, . . . ,t consider the rational function 

= PCgA/p (2(w* • x)/p) . 
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Fix any 2 G AT. As before we have that |2(w* • z)j p\ G [1, |], so by Lemma 
2 the value of Qi{z) differs from the ±1 value hi{z) = sign(w* • z) by at most 
Since / is a Boolean function on t inputs, it is expressible as a multilinear 
polynomial / of degree t, with coefficients of the form f/2* where i is an integer 
in [—2*, 2*]. (The polynomial / is just the Fourier representation of /.) Multiply 
/ by 2‘, so now / : {+1, —1}* — >■ {+2‘, —2*}, and / has integer coefficients which 
are at most 2* in absolute value. 

Now we would like to argue that f{Qi{z), . . . ,Qt{z)) has the same sign as 
f{hi{z), . . . , ht{z)). To do this we show that the “error” of each Qi{z) relative 
to the ±1 value hi{z) (which error is at most does not cause / to have the 
wrong sign. The polynomial / has at most 2* terms, each of which is the product 
of an integer coefficient of magnitude at most 2* and up to t of the Qi’s. The 
product of the Qi’s incurs error at most 0(t2“^*) relative to the corresponding 
product of the hi’s, and thus the error of any given term (including the integer 
coefficient) is at most 0(t2“^*). Since we add up at most 2* terms, the overall 
error is at most 0{t2~*) error, which is much less than what we could tolerate 
(we could tolerate error 2*; recall that / takes value ±2* on ±1 inputs). Thus 
f{Qi{z ), . . . , Qt{z)) has the same sign as f{hi{z ), . . . , ht{z)) for all z £ X. 

Now / is a multilinear polynomial of degree t, and each Qi is a rational func- 
tion of degree 0{tlogw). We can bring f{Qi , . . . , Qt); to a common denominator 
(which is the product of the denominators of the Qi’s) of degree O(t^logw). 
Hence we have a single multivariate rational function A{x) / B(x) which takes 
the right sign on 2 , and we can convert this rational function to a polynomial 
threshold function p{x) = A{x)B{x) as in the proof of Theorem 4. 

Now we must bound ||p||. Let Qi{x) = . The analysis from the previ- 

ous proof implies that ||oi(a;)|p and ||6i(x)|p are both at most (^)'^(*'°8i/p). 
Now consider a monomial (in the “variables” Qi{x), . . . ,Qt{x)) in the poly- 
nomial f{Qi{x),...,Qt{x)). Since the numerator a{x) of such a monomial is 
the product of at most t of the Oi(x)’s, and each ai{x) has degree at most 
0(logtlog i), the fact that d < k and part (1) of Fact 3 together give ||a(x)|p < 

;,0(i log t log i/p)(-y)0(F log i/p) ('^)0(Fiogi/p) 

the denominator P{x) of such a monomial. Since the common denomiator for 
/(Qi, . . . , Qt) is the product of the denominators of the Qi’s, clearing all denomi- 
nators we have that /(Qi, . . . , Qt) = A{x)/B{x) with ||A(a;)|p and ||H(a;)|p both 
at most (^)0(‘'i°si/p). We thus have \\p{x)f = \\A{x)B{x)f = (^)0(i"iogi/p) 
and the theorem is proved. □ 



4.2 Constructions Using Extremal Polynomials 

The bounds from the previous section are quite strong when t is relatively small. 
If t is large but p is also quite large, then the following bounds based on Cheby- 
shev polynomials are better. 

The r-th Chebyshev polynomial of the first kind, Tr{x), is a univariate degree- 
r polynomial with the following properties [9]: 
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Lemma 3. The polynomial Tr{x) = satisfies: (i) |T'r(a;)| < 1 for 

|x| < 1 with Tr{l) = 1; (ii) T^{x) > r'^ for x > 1 with T^(l) = r^; and (Hi) For 
i = 0, . . . ,r each at is an integer with |ai| < 2’’. 

The following theorem generalizes results in [16]: 

Theorem 6. Let X he a subset of with | < ||X|| < 2 and let c : — >■ 

{ — 1,1} be an intersection oft origin-centered halfspaces If c has 

margin p on X then there is a PTF of degree d = 0(^/1/ plog f) for c on X. If 

d < k then this PTF has PTF margin on X. 

Proof: As in the previous proofs we must exhibit a polynomial p(x) such that 
for any 2 G A we have sign(p(z)) = c{z) and > i/fcO(v^i°gb, 

Let • a: = 0, . . . , w* • a; = 0 be the t hyperplanes for halfspaces hi,. . . ,ht 
where each ||^c*|| = 1. Let P be the univariate polynomial P{x) = Tr{l—x) where 
r = [ i/2/p] . The first part of Lemma 3 implies that |-P(a;) | < 1 for a; G [0,2], and 
the second part implies that P{x) > 2 for x < Now consider the polynomial 
threshold function sign(p(x)) where 

Since P is a polynomial of degree r = \^2j p\ and w* • x is a polynomial of 
degree 1, this polynomial threshold function has degree d = [ \f^p\ • [log 2t\ . 
We now show that p{x) has the desired properties described above. 

We first show that for any z G A the polynomial p takes the right sign 
and has magnitude at least Fix any z G A. For each i = 1, ... ,t we have 
f <pI|X||<K.z|<|KHI^II<2. 

— If c(z) = 1 then for each i we have | < ic* • z < 2 and hence we have that 

P{w'‘ ■ z) (and also P(w* • z) lies in [—1,1]. Consequently we have that 

p(z) >t+\-t>\so sign(p(z)) = c(z) = 1. 

— If c(z) = —1 then for some i we have ■ z £ [— 2, — |], so consequently 

P{w^ ■ z) >2 and P(w* • > 2t. Since P{w^ ■ z;)d°s2tl > _x for all j, 

we have p{z) < t + ^ — 2t + (t — 1) = —I so sign(p(z)) = c(z) = —1. 

To finish the proof it remains to bound jjp]]. Since |]w* • x||^ = 1 for all i, by 
part 2 of Fact 3 we have |]1 — w* • x|p < 4 so by part 1 of Fact 3 we have that 
|j (1 — w* • xy II < {4ky for j = 0, . . . , r. Since (by Lemma 3) Tr{x) = =o 
where each [ojl < 2’’, for each j = 0, . . . , r we have |[aj(l — w* • x)-^ ||^ < 2^’’(4fc)’’. 
By part 2 of Fact 3 we obtain ||Tr(l — w’’ ■ x)|[^ < (r + l)^(16fc)’’, and now part 
1 implies that {P{w^ ■ Using part 2 again we obtain that 

IIpII < (i + = /jO(riogt)^ theorem is proved. □ 

As Arriaga and Vempala observed in [3], DNF formulas can be viewed as 
unions of halfspaces. If we rescale the cube so that it is a subset of it is 

easy to check that a Boolean function / : (—1, 1}^ — >■ (—1, 1} has margin p with 
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respect to X C {—1, 1}^ if for every z G X we have that every Boolean string z' 

2 

which differs from z in at most a ^ fraction of bits has f{z') = f{z). 

Since any DNF formula with t terms can be expressed as a union of t halfs- 
paces, we have the following corollary of Theorem 6: 

Corollary 2. Let X C {—1, 1}^ and let c be a t-term DNF formula on k vari- 
ables. If c has margin p on X then there is a polynomial threshold function of 
degree 0(i/l/plogt) for c on X which has PTF margin on X. 

If d < k then this PTF has PTF margin on X. 

A similar corollary for DNF formulas also follows from Theorem 4 but we 
are most interested in DNFs with t =poly(n) terms so we focus on Theorem 6. 

5 Kernel Perceptron for learning PTFs with PTF Margin 

In this section we first define a new kernel, the Complete Symmetric Kernel, 
which arises naturally in the context of polynomial threshold functions. We give 
an efficient algorithm for computing this kernel (which may be of independent 
interest), and indeed all results of the paper could be proved using this new 
kernel. To make our overall algorithm simpler, however, we ultimately use the 
standard polynomial kernel which we discuss later in this section. 

Let 4>d ■ — >■ 1 be a feature expansion which maps {xi, . . . ,Xk) to 

the vector (1, xi, . . . , Xk, xf, xiX 2 , ■ ■ ■) containing all monomials of degree up to 
d. Let Kd(x,y) = 4>d{x) ■ 4>d{y) be the kernel corresponding to 4>d- We refer to 
Kd{x, y) as the complete symmetric kernel since as explained in Appendix B the 
value Kd{x,y) equals the sum of certain complete symmetric polynomials. 

For a data set X C R^ we write (fd{X) to denote the expanded data set of 

fk + d\ 

points in R1 ' . The following lemma gives a mistake bound for the Perceptron 
algorithm using the complete symmetric kernel: 

Lemma 4. Let X C R^ be a set of labelled examples such that there is some 
degree-d polynomial threshold function p{x) which correctly classifies X and has 
PTF margin p over X. Then the Perceptron algorithm (run on 4>d{X) using the 
complete symmetric kernel Kd) makes at most mistakes on X. 

(k+d\ 

Proof. The vector IF G R1 1 whose coordinates are the coefficients of p has 
O’^er </>d(X). Since W ■ (j)d{z) = p{z) and ||IF|| = ||p||, the 
lemma follows by from the definition of the PTF margin of p and the Perceptron 
Convergence Theorem (Theorem 1). □ 

In the full version of this paper (available on either author’s web page) we 
give a polynomial time algorithm for computing Kd{x,y), but this algorithm is 
somewhat cumbersome. With the aim of obtaining a faster and simpler overall 
algorithm, we now describe an alternate approach based on the well known 
polynomial kernel. 




Learning Intersections of Halfspaces with a Margin 359 



As in [10], we define the degree-d polynomial kernel K'^ : x ^ R as 

y) = (1 + X • yY- It is clear that i^^(x, y) can be computed efficiently. Let 

, /k+d\ 

(j)'^ : R'^ — >■ Rl 1 be the feature expansion such that K'^{x,y) = (p'^{x) ■ (p'div)’ 
note that (j)'j{x) differs from </>d(a;) defined above because of the coefficients that 
arise in the expansion of (1 + x • 

We have the following polynomial kernel analogue of Lemma 4: 



Lemma 5. Let X C R^ be a set of labelled examples such that there is some 
degree-d polynomial threshold function p{x) which correctly classifies X and has 
PTF margin p over X. Then the Perceptron algorithm (run on (j)'d{X) using the 

4M4-II 

polynomial kernel K'd) makes at most — — ^2 mistakes on X. 



Proof. We view 4>'d(x) as a vector (asxs) of monomials with coefficients. By 
inspection of the coefficients of (1 + x • it is clear that each og > 1. Let W 



(k + d\ 

be the vector in Rw / such that W • </>^(x) = p(x) as a formal polynomial. For 
each monomial xg in p{x), the Wg coordinate of W equals LLg/ag < W$ where 
W is defined as in the proof of Lemma 4 so we have || W'|| < ||kL||. 



The vector W has margin ™|pv'||-j|^ ~ 



_M£)L 



> 



\p(^)\ 






over (j)'d{X). It is easy to verify that ||^[j(A)|| < (1 + so W has 

margin at least jf|| 2 )d /2 = (i_|_||x|p)<i/^ ■ lemma now follows from the 

Perceptron Convergence Theorem. □ 



The output hypothesis of this kernel Perceptron is an (implicit representation 

/fe + d\ 

of a) halfspace over Rl ' which can be viewed as a polynomial threshold 
function of degree d over R^. 



6 The Main Results 

In this section we give our main learning results by bounding the running time 
of algorithm A and proving that it outputs an accurate hypothesis. 

Our first theorem gives a good bound for the case where t is relatively small: 



Theorem 7. Algorithm A learns any p-margin intersection oft halfspaces over 
R" in at most 7 • (^ log time steps. 

Proof. Let c be an intersection of t origin-centered halfspaces over R" which has 
margin p with respect to distribution T) where Supp(2?) C 5'"“^. Let m equal 
the number of examples our algorithm draws from EX(c, T>); we defer specifying 
m until the end of the proof. Let k = ■ log y), and d = 0{tlogtlog ^). 

Let X be the set of m examples in R”, and let M{X) be the projected set of m 
examples in R^. Note that it takes nkm time steps to construct the set M{X). 

By Corollary 1, with probability 1 — <5 we have that | < ||M(A)|| < 2 and 
there is an intersection of t origin-centered halfspaces in Rr which has margin at 
least 7 on M{X). By Theorem 4 there is a polynomial threshold function over 
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of degree d = 0{tlogtlog which has PTF margin with respect 

to M{X). By Lemma 5 the degree-d polynomial kernel Perceptron algorithm 
makes at most mistakes when run on M{X), and thus once M{X) is 

obtained the algorithm runs for at most m • time steps. 

Now we show that with probability 1 — 5 algorithm A outputs an e-accurate 
hypothesis for c relative to T>. Since the output hypothesis h{x) = s\gn{p{Mx)) 
is computed by first projecting x G R" down to R^ via M and then evaluating 
the /c-variable PTF p, it suffices to show that p is a good hypothesis under 
the distribution M{T>) obtained by projecting V down to R^ via M. It is 
well known (see e.g. [2]) that the VC dimension of the class of degree-d PTFs 
over k real variables is Thus by the VC theorem [8] in order to learn 

to accuracy e and confidence S it suffices to take m = 0(^log^ + Mogj). 
It is straightforward to verify that k = (^log ^ ^(^logi)OW 

satisfy the above conditions on m and k. Since d = 0(tlogtlog we have k = 
(| log and w = i • (| log T)‘^(**°stiogi/p) -v^hich proves the theorem. □ 

Note that for a constant t = 0(1) number of halfspaces Algorithm A has a 
quasipolynomial ((l)‘^(*°s i/f’)) runtime dependence on the margin p, in contrast 

with the exponential ((l)'^(*°s p)/p dependence of [3]. 

The proof of Theorem 7 used the polynomial threshold function construction 
of Theorem 4. We can instead use the construction of Theorem 6 to obtain: 

Theorem 8. Algorithm A learns any p-margin intersection oft halfspaces over 
R” in at most ^ ■ (^^ log X'j^(yAJpiogt) steps. 

For a constant p = 0(1) margin Algorithm A has an almost polynomial 
((^O(iogiogt)^ runtime dependence on t, in contrast with the exponential ft^A)'^ 
dependence of [3]. By Corollary 2 the above bound holds for learning t-term 
DNF with margin p as well. 

Finally, we can use the construction of Theorem 5 to obtain: 

Theorem 9. Algorithm A learns any Boolean function oft halfspaces with mar- 
gin p in at most j ■ (^ log log i/p) steps. 

7 Discussion 

Is Random Projection Necessary? A natural question is whether our quanti- 
tative results could be achieved simply by using kernel Perceptron (or a Support 
Vector Machine) without first performing random projection. Given a data set 
X in R" classified by an intersection of t = 2 halfspaces with margin p, Theo- 
rem 4 implies the existence of a polynomial threshold function for X of degree 
d = 0(log(l/p)) with PTF margin (p/n)'^''^°^A/p)) ^ Using either the degree-d 
polynomial kernel or the Complete Symmetric Kernel, we obtain a halfspace 
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over r( ) which classifies the expanded data set (f>{X) with geometric mar- 
gin Thus it appears that without the initial projection step, 

the required sample complexity for either kernel Perceptron or an SVM will be 

(^/p)t2(log(l/p))^ 

as opposed to the bounds in Section 6 which do not depend on 
n; so random projection does indeed seem to provide a gain in efficiency. 

Lower Bounds on Polynomial Threshold Functions. The main result of 
O’Donnell and Servedio in [19], if suitably interpreted, proves that there exists a 
set X C labelled according to the intersection of two halfspaces with margin 
p for which any PTF correctly classifying X must have degree io|Tio^g(i/p) ) ■ 
This lower bound implies that our choice of d in the proof of Theorem 7 is 
essentially optimal with respect to p. For a discussion of other lower bounds on 
PTF constructions see Klivans et al. [15]. 

Alternative Algorithms. We note that after random projection, in Step 3 of 
Algorithm A there are several other algorithms that could be used instead of 
kernel Perceptron. For example, we could run a support vector machine over R^ 
with the same degree d polynomial kernel to find the maximum margin hyper- 

/fc + d'l 

plane in R1 <>■ >; alternatively we could even explicitly expand each projected 
example M{x) € R"’ into G Rl 1 and explicitly run Perceptron 

(or indeed any algorithm for solving linear programs such as the Ellipsoid al- 
gorithm) to learn a single halfspace in Rl F It can be verified that each of 
these approaches gives the same asymptotic runtime and sample complexity as 
our kernel Perceptron approach. We use kernel Perceptron both for its simplicity 
and for its ability to take advantage of the actual margin if it is better than the 
worst-case bounds presented here. 

Future Work and Implications for Practice. We feel that our results give 
some theoretical justification for the effectiveness of the polynomial kernel in 
practice, as kernel Perceptron takes direct advantage of the representational 
power of polynomial threshold functions. We are working on experimentally 
assessing the algorithm’s performance. 

Acknowledgements. We thank Santosh Vempala for helpful discussions. 
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Abstract. The decomposition method is currently one of the major 
methods for solving the convex quadratic optimization problems being 
associated with support vector machines. Although there exist some ver- 
sions of the method that are known to converge to an optimal solution, 
the general convergence properties of the method are not yet fully under- 
stood. In this paper, we present a variant of the decomposition method 
that basically converges for any convex quadratic optimization problem 
provided that the policy for working set selection satisfies three abstract 
conditions. We furthermore design a concrete policy that meets these 
requirements. 



1 Introduction 

Support vector machines (SVMs) introduced by Vapnik and co-workers [4,25] 
are a promising technique for classification, function approximation, and other 
key problems in statistical learning theory. In this paper, we mainly discuss the 
optimization problems that are induced by SVMs, which are special cases of 
convex quadratic optimization.^ 

Example 1. Two popular variants of SVMs lead to the optimization problems 
given by (1) and (2), respectively: 

^ m m m m 

min - EE QijXiXj — ^ Xi s.t. ^ jjiXi = 0 , Vz = 1, . . . , m : 0 < Xj < C 

i=l j — 1 2=1 i—1 

(1) 

min - QijXiXj s.t. y^ yiXi = 0, y^ Xi>v, Vz = 1, . . . , m : 0 < Xi < — 

X z m 

i=\ j = l i=\ i = l 

(2) 

* This work has been supported by the Deutsche Forschungsgemeinschaft Grant SI 
498/7-1. 

^ The reader interested in more background information about SVMs is referred to [25, 
6,23]. 
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Here, Q G is a positive (semi-)definite matrix, y G {—I,!}™, and x 

is a vector of m real variables. C and ly are real constants. The first problem 
is related to one of the classical SVM models; the second-one is related to the 
so-called i/-SVM introduced by Scholkopf, Smola, Williamson, and Bartlett [24]. 

The difficulty of solving problems of this kind is the density of Q whose 
entries are typically non-zero. Thus, a prohibitive amount of memory is required 
to store the matrix and traditional optimization algorithms (such as Newton, for 
example) cannot be directly applied. Several authors have proposed (different 
variants of) a decomposition method to overcome this difficulty [20,11,21,22,5,13, 
17,14,12,18,19,15,9,16,10]. This method keeps track of a current feasible solution 
which is iteratively improved. In each iteration the variable indices are split into a 
“working set” / C {1, . . . ,m} and its complement J = {1, . . . ,m}\I. Then, the 
subproblem with variables Xi, i G I, is solved, thereby leaving the values for the 
remaining variables Xj, j G J, unchanged. The success of the method depends 
in a quite sensitive manner on the policy for the selection of the working set 
I (whose size is typically bounded by a small constant). Ideally, the selection 
procedure should be computationally efficient and, at the same time, effective in 
the sense that the resulting sequence of feasible solutions converges (with high 
speed) to an optimal limit point. Clearly, these goals are conflicting in general 
and trade-offs are to be expected. At the time being, it seems fair to say that 
the issue of convergence is not fully understood (although some of the papers 
mentioned above certainly shed some light on this question). 

We briefly note that also the random sampling technique applied in [2,1] 
(and being based on the Simple Sampling Lemma by Gartner and Welzl [7]) 
can be viewed as a kind of decomposition method. Here, the working sets (= 
samples) are probabilistically selected according to a dynamic weighting scheme. 
The general idea is to update the weights in such a fashion that the support vec- 
tors not yet included in the sample become more and more likely to be chosen. 
At some point the sample will contain enough support vectors such that the 
solution obtained in the next iteration will be globally optimal. The efficiency 
of this technique seems to depend strongly on a parameter k that can be rigor- 
ously defined in mathematical terms but is unknown in practice. Parameter k is 
certainly bounded by m but might be much smaller under lucky circumstances. 
The sample size grows quadratically in k and in the dimension n of the feature 
space. If k and n are much smaller than m, the random sampling technique 
seems to produce nice results. We briefly point to the main differences between 
the random sampling technique and other work on the decomposition method 
(including ours): 

- random selection of the working set 

- dependence of the performance on an unknown parameter k 

- comparably large working sets (samples) 

- very few iterations on the average to optimum if k is small 

We close the introduction by explaining the main difference between this 
paper and earlier work on the decomposition method. It seems that all existing 
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papers concerned with the decomposition method perform a kind of non-uniform 
analysis in the sense that the results very much depend on the concrete instance 
of convex quadratic optimization that is induced by the specific SVM under con- 
sideration. Given the practical importance of SVM problems, this is certainly 
justified and may occasionally lead to methods with nice properties (concerning 
efficiency of working set selection and speed of convergence). On the long run, 
however, it bears the danger that any new variant of a SVM must be analyzed 
from scratch because the generality (if any) of the arguments being used so far 
is too much left in the dark. In this paper, we pursue the goal to establish con- 
vergence in a quite general setting. We present a variant of the decomposition 
method that converges for basically any convex quadratic optimization problem 
provided that the policy for working set selection satisfies three abstract condi- 
tions. We furthermore design a concrete policy that meets these requirements. 
We admittedly ignore computational issues. The analysis of the trade-off be- 
tween computational efficiency, speed of convergence, and degree of generality is 
left as object of future research. 

2 Definitions, Notations, and Basic Facts 

For a matrix A G G denotes the Fth column. A^ G denotes 

the transpose of A. Vectors are considered as column vectors such that the 
transpose of a vector is a row vector. The “all-zeroes” vector is denoted as 0, 
where its dimension will always become clear from the context. For two vectors 
w,x G K™, w^x = '^^iWiXi denotes the standard scalar product. ||a;|| := 

X / 2 

(E™ -ix“f) denotes the Euclidean norm of x. We often consider complementary 
sets I C {!,... , m} , J = {!,... ,m}\I of indices. The notation Aj refers to the 
submatrix of A consisting of all column Ai such that i G I. The equation A = 
[Ai,Aj] means that A decomposes into submatrices Aj,Aj (although, strictly 
speaking, the equation holds only after the columns of are permuted 

such that they are ordered as in A). A similar convention is applied to vectors 
such that equations like Ax = b can be expanded to 

Similarly, a matrix Q G decomposes into four blocks Qij, Qi^j, Qjj, Qj,j 

such that an expression like x^ Qx can be expanded to 

x^ Qx = xJQijxi + xJQijxj + xJQjjxi + xJQj^xj . 

If Q is symmetric (in particular, if Q is positive (semi- (definite), then 
xjQi,jxj = xjQjjXi. 

Let V denote an optimization problem that is given by a cost function f{x) 
and a collection of constraints, where x denotes a collection of real-valued vari- 
ables. As usual, a feasible solution for V is an assigment of values to the variables 
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that satisfies all constraints. The feasibility region (consisting of all feasible solu- 
tions for V) is denoted as R{V). The smallest possible cost of a feasible solution 
is then given by 

opt(P) = min f(x) . 

x€R(V) 

Writing “min” instead of “inf” is justified because we will deal only with prob- 
lems V whose feasibility region is compact. In the remainder of the paper, we 
assume some familiarity with mathematical programming and matrix theory. 

2.1 Convex Quadratic Programming Subject to Box Constraints 

Throughout this paper, 

^ ^ m m m 

f{x) = -x^Qx + w^x = 2 X! X! + X! 

i=l j = l i—1 

denotes a convex cost function, where Q G is a positive semi-definite 

matrix over the reals with the additional (somewhat technical) property that, 
for each / C {!,... ,m} of size at most q, the submatrix Qjj of Q is positive 
definite. Here, q denotes a (typically small) constant (which will later bound 
from above the size of the working set) . Note that the technical condition for Q 
is satisfied if Q itself is positive definite. As the structure of the cost function has 
become clear by now, we move on and define our basic optimization problem V: 

min/(a;) s.t. Ax = b,l < x < r (4) 

X 

Here, A G b G l,r G K™, and I < x < r is the short-notation for the 

“box constraints” 



\/i = 1, . . . ,m : k < Xi < Xi . 

A few comments are in place: 

- Any bounded^ optimization problem with cost function f{x) and linear 
equality- and inequality-constraints can be brought into the form (4) because 
we may convert the linear inequalities into linear equations by introducing 
non-negative slack variables. By the compactness of the feasibility region, 
we may also put a suitable upper bound on each slack variable such that the 
remaining linear inequalities take the form of box constraints. 

- The technical assumption that we have put on matrix Q is slightly more 
restrictive than just assuming it is positive semi-definite. As far as the de- 
composition method and SVM applications are concerned, this assumption 
if often satisfied.^ 

^ Here, “bounded” means that the feasibility region is compact (or can be made com- 
pact without changing the smallest possible cost). 

® For some kernels like, for example, the RBF-kernel, it is certainly true; for other ker- 
nels it typically satisfied provided that q is sufficiently small. See also the discussion 
of this point in [17]. 




A General Convergence Theorem for the Decomposition Method 367 



In order to illustrate the first comment, we convert problem (2) in a problem 
with box constraints by introducing the slack variable 

1 m ^ 

min Qx s.t. 'ij x — 0, Xi — i f <1 , Vi = l,... ,m : 0 < Xi < — 

a:,e 2 m 

i=l 

(5) 

The optimal solutions for V can be characterized in terms of the gradient 
V f{x) = Qx + tc as follows: 

Lemma 1. Let V denote the optimization problem that is induced by A G 
b € and l,r G as described in (4) and let U denote the linear subspace of 
M™ that is spanned by the rows of matrix A. Then, x is optimal for V iff there 
exists u G U such that 



Xi yf ^ f{x')i Ui ^ 0 and Xi h Ui ^ f{x^i ^ 0 . (6) 

holds for i = 1, . . . ,m. 

Proof. It is well-known that x is optimal for V iff it satisfies the Karush-Kuhn- 
Tucker conditions. The latters are easily seen to be equivalent to the existence 
of /3 G such that the following holds for i = 1, . . . , m: 

Xi^ri^ V/(x)i - Aj(3>0 and x, ^ k ^ Aj (3 - Vf{x)i > 0 . 

(Recall the convention that Ai G denotes the i’th column of A.) The lemma 
now follows from the observation that A^ (3 ranges over U when (3 ranges over 

□ 

With each x G K™, we associate the function 

m 

C(x) := inf (xi - k) max{0, V/(a;)j - uj -h {ri - xf) max{0, Ui - V f{x)i} , 

u^U 

( 7 ) 



whose properties are summarized in 

Lemma 2. C{x) is a continuous function on R{V). Moreover, for x € R{P), 
C{x) > 0 with equality iff x is optimal for V. 

Proof. We first show that C{x) is continuous. Obviously function 

m 

C{x,u) := - /i)max{0, V/(x)i - Ui} + {u - max{0, - V/(x)^} 

is continuous in x and u. Moreover, C{x) = inf^gy C{x,u). With each constant 
R > 0, we associate the compact region U{B) := U C\ {u G K™! ||m|| < B}. 
It is not hard to see that there exists a constant B > 0 such that C(x) = 
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inf„g( 7 (B) C{x, u) holds for each x G RiV)- By compactness, C{x, u) is uniformly 
continuous on R{V) x U{B). Thus, for all x,x' G RiV) and each e > 0, there 
exists i5 > 0 such that 

Vm G U{B) : \\x' — x|| < (5 \C{x',u) — C(cc,m)| < e . 

Since the latter statement implies that \C{x') — C{x)\ < e, we may conclude that 
C{x) is continuous. 

If X is a feasible solution for V, then I < x < r, which clearly implies that 
C{x) > 0. Furthermore, C{x) = 0 iff there exists u £ U such that (6) is satisfied. 
According to Lemma 1, this is true iff x optimally solves V. □ 

The method of feasible directions by Zoutendijk [26] allows for another char- 
acterization of the optimal solutions for V. To this end, we associate the following 
optimization problem T’[x] with each x G R{V)'. 



min V/(x)^d s.t. Ad = 0, Vi = 1, . . . , to : — 1 < < 1 A 

d 

( 8 ) 

Intuitively, Vf{x)^d < 0 indicates that we can reduce the cost of the current 
solution X for V by moving it in direction d. More formally, the following holds: 

Lemma 3 ([26,3]). opt(2?[x]) < 0 with equality iff x is optimal for V . 



( Xi — li di ^ 0 

( Xi — Vi di 0 



2.2 Subproblems Induced by Working Sets 

With a set I C {!,... ,to}, we will always associate its complement J = I. 
Furthermore, we use the short-notation Rj = {xi\ x £ R] for each R C K™. 

For each I C {!,... ,to} and each xj G R{V)j, we denote by Vi^xj the 
problem that results from V by leaving xj unchanged and choosing xj such as 
to minimize f{x) subject to the constraints in V. More formally, for cost function 

fi,xj{xi) = ^xjQijXi + {Qpjxj + wi)^xi 

(with gradient fi,xj(xi) = Qijxj + Qijxj + wj), problem Vi^xj reads as 
follows: 



mmfi x,{xi) s.t. Ajxi = b- Ajxj, li < xj < ri 

XI 



Note that this problem is of type (4) with xi,Qij,QpjXj + wi,Ai,b — 
AjXj, Ij, rj substituted for x, Q, w, A, b, I, r, respectively. Note furthermore that 
xj is a feasible solution for Pi^xj iff x G P, i.e., iff x/ extends xj to a feasible so- 
lution X £ V. Recall that, according to our notational conventions, its feasibility 
region is written as R{Vi^xj)- 
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2.3 The Decomposition Method 



Let <7 be a constant that bounds from above the size of the working set. Let 
be a family of functions from R{P) to Kj. With each such 
family, we associate the following method for solving V: 

(1) Let be a feasible solution for V (arbitrarily chosen) and s := 0. 

(2) Construct a working set I® C ,m} that maximizes c(I) := Ci{x^) 

subject to |/| < q. If c(/®) = 0, then return x® and stop; otherwise set 
J® := {1, . . . , m} \ I®, find an optimal solution for , set 



:= X 



JO) 



„s+l 



,®+l 

,S+1 



, s := s + 1 



and goto (2). 



We refer to this algorithm as the decomposition method induced by (Ci{x)). 
We will show in section 3 that it converges to an optimal solution for V if the 
following conditions hold: 



(Cl) For each / C {1, . . . ,m} such that |/| < q, Cj{x) is continuous on R{V). 



(C2) If |/| < q and xj is an optimal solution for Vi^xj, then =0. 

(C3) If X is not an optimal solution for V, then there exists an / C {1, . . . , m} 
such that |/| < q and Ci{x) > 0. 



If these conditions are satisfied, we call the family (Ci{x)) a q-sparse witness of 
suhoptimality. In section 4, we will present such a family of functions provided 
that q> k + 1. 

A few comments are in place here, a;®’*'^ is always a feasible solution for V. 
Moreover, is (by construction) an optimal solution for Pie^x‘j- Thus, 



C'/.(a;®+^) = 0 



(9) 



according to (C2). If a;® is (accidentally) an optimal solution for V, then it is 
(a-fortiori) an optimal solution for each subproblem Vpxj and, again according 
to (C2), the decomposition method will reach the stop-condition and return a;®. 
If X® is not optimal for V, then (C3) makes sure that there exists a working 
set I of size at most q such that Ci{x^) > 0. Thus, the working set 7® actually 
constructed by the decomposition method satisfies 



C/.(x®) > 0 , 



(10) 



and the method cannot become stuck at a suboptimal solution. 

We assume in the sequel that the sequence x® evolves as described above. 
Note that /(x®) is decreasing with s (simply because x% is a feasible solution 
for Via^x‘js)- Thus, /(x®) will converge to a limit even if x® does not converge 
to a limit point. However, since the feasibility region for V is compact, there 
must exist a subsequence (x®)sgg that converges to a (feasible!) limit point, say 
x°° . Clearly, f{x°°) = lims_>oo /(a^®)- K remains to show that x°° is an optimal 
solution. 
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3 Analysis of the Decomposition Method 

This section is devoted to the proof of convergence. The proof will proceed by 
assuming, for sake of contradiction, that is not an optimal solution for V . 
From condition (C3) and from a continuity argument, we will be able to conclude 
that is not even optimal for subproblem if s G S' is chosen sufficiently 

large. Since is an optimal solution for this subproblem (by the definition 
of the decomposition method), we would now be close to a contradiction if the 
continuity argument also applied to s+1. Here, however, we bomb into a difficulty 
since s + 1 does not necessarly belong to S. Thus, although sequence (cc^)sg 5 
approaches x°° when n approaches infinity, sequence (a;®“''^)sgS might perhaps 
behave differently? It turns out, however, that this is not the case. The main 
argument against this hypothetical possibility will be that the cost reduction 
per iteration of the decomposition method is proportional to the square of the 
distance between x‘^ and x“^^ . The following subsections flesh out this general 
idea. 



3.1 Cost Reduction per Iteration 

How big is the cost reduction when we pass from x® to Here is an answer 

to this question: 

Lemma 4. Let f{x) he the cost function ofV as given by (3). Let 



a := min eig(Q/ r) > 0 , 



where eig(-) denotes the smallest eigenvalue of a matrixf With these notations, 
the following holds: 

/(x®+i)-/(x®)<-|||x®+i-x®|p . 

Proof. Since / is a quadratic function of the form (3), Taylor-expansion around 
yields 

/(x®) = /(x®+^) + V/(x®+i)^(x® - x®+i) + - x®+i)^Q(x® - x®+^) . 

( 11 ) 



Recall that x®+^ minimizes /(x) subject to the constraints Ax = b, I < x < r, 
and Xjl'^ = Xjs. Since these constraints define a convex region containing x® 
and x®“*"^, we may conclude that /(x®^^) = f{x) where L denotes the 

line segment between x® and x®"*"^. Thus, the gradient at x®"*"^ in direction to x® 
is ascending, i.e., 

V/(x®+i)^(x® -x®+i) > 0 . (12) 

^ Note that the technical property that we have put on Q in section 2.1 makes snre 
that a is strictly positive. 
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Note furthermore that 

(x" - (13) 

is an immediate consequence of the Courant-Fischer Minimax Theorem [8]. 
From (11), (12), and (13), the lemma follows. □ 

We briefly note that Lin [17] has shown a similar lemma for the special 
optimization problem (1). Although our lemma is more general, the proof found 
in [17] is much more complicated. 



3.2 Facts Being Valid Asymptotically 
Lemma 5. For each 5 > 0, there exists sq > 1 such that 

Ik® - x“|| < ^ , ||x"+^ - ^ ^ < <5 . 

holds for all s € S provided that s > Sq-^ 

Proof. Recall that (/(x^))s>i is a monotonously decreasing sequence that ap- 
proaches f{a°°) when s tends to infinity. Thus, there exists Sq such that 

0 < /(xk - /(x*+i) < /(xk - f{xn < ^ 

holds for all s > Sq. According to Lemma 4, 



Ik®+1 - x*|| < , 

Thus 

holds for all s > Sq. Since (x®)gg 5 converges to x°°, there exists Sq such that 




holds for all s G 5, s > Sg. Setting sg = max{sQ, Sg}) we obtain 
||x^+i-xl<^and ||x*-x“||<^ 

for all s G S', s > Sg. This implies that ||x^+^ — x°°|| < S and completes the proof 
of the lemma. □ 



® Inequality [[x'’''"'^ ~x°°ll S d, which is immediate from the preceding two inequalities, 
has been included for ease of later reference. 
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Corollary 1. If (Ci{x)) satisfies condition (Cl), then, for each e > 0, there 
exists So > 1 such that 

\Ci{x°°)-Ci{xn\<e (14) 

\Ci{x°^)-Ci{x-^+^)\<e (15) 

holds for each working set I C ,m} of size at most q and for all s € S 

provided that s> Sq. 

Proof. The corollary easily follows from Lemma 5, the fact that there are only 
finitely many sets I C {1, . . . ,m} of size at most q, and condition (Cl) stating 

that each individual function Ci{x) is continuous. □ 

3.3 The Main Theorem 

Theorem 1. Assume that (C/(x)) satisfies conditions (C1),(C2),(C3), i.e., it 
is a q-sparse witness of suboptimality. Let a;'* be a sequence of legal solutions 
for V that is produced by the decomposition method induced by {Ci{x)) and let 
{x‘^)a^s be a converging subsequence. Then, the limit point x°° of {x‘^)s^s is an 
optimal solution for V. 

Proof. Assume for sake of contradiction that x°° is not an optimal solution for 
P. According to (C3), there exists a working set J C {1, . . . , m} such that |/| < q 
and 



eo := Ci{x°^) > 0 . 

In the sequel, we will apply Corollary 1 three times with e = eo/3, respectively. 
Assume that s € S' is sufficiently large in the sense of Corollary 1 such that, 
according to (14), the following holds: 

ciixn > ^ . 

Thus, the working set P returned by the decomposition method in iteration 
s + 1 satisfies 

CAP)>^ ■ 

Another application of (14) leads to 
From (15), we get 

Cis{x^+^) > 0 . 

Since is an optimal solution for we may however infer from (9) that 

C/.(x®+^) = 0 . 



We arrived at a a contradiction. 



□ 
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4 A Sparse Witness of Sub-optimality 

In this section, we present a concrete family (Cj(x)) of functions that satisfies the 
conditions (C1),(C2),(C3) needed for our proof of convergence from section 3. 
We will define C/(x) such that it plays the same role for P/^xj that the function 
C(x) (defined in (7)) has played for P. More formally, let U/ denote the subspace 
spanned by the rows of Ai, and define Ci{x) to be equal to 



inf {xi - U) ma.x{0,V f I, xj{xi)i - m} + {n - Xi) max{0,Mi - V fi,x,{xi)i} . 

uGUt 

i^l 

In what follows, we use the notations Ci{x) and Cj_xj{xi) interchangeably. The 
former notation stresses that Cj{x) is viewed as a function of all components 
of X, whereas the latter notation stresses the relation between this function and 
the subproblem Pi^xj that is explained in Corollary 2 below. 

Recall the optimization problem V[x\ from (8). Let Vi^xj[xi] be the opti- 
mization problem given by 

xAmV fi^xj{xi)^ di s.t. Ajdi = 0,Vf G / : — 1 < dj < 1 A 

di 



( Xi — li di ^ 0 

( Xi — Vi di 0 



Now, Lemmas 2 and 3 applied to the subproblem Pi^xj induced by I and xj 
read as follows: 



Corollary 2. 1. Ci^xj{xi) is a continuous function on R{Pi^xj)- Moreover, 

Ci,xj{xi) > 0 with equality iff xj is optimal for Pj^xj ■ 

2. op>t{P I ^x j[x i]) < 0 with equality iff xj is optimal for Pi ^xj- 



The first statement in Corollary 2 can clearly be strengthened: 



Remark 1. Ci{x) = Ci^xj(xi) viewed as a function in x 
on R{P). 




is continuous 



This already settles conditions (Cl) and (C2). Condition (C3) is settled by 



Lemma 6. If x is a feasible solution for P that is not optimal, then there exists 
a working set / C {1, . . . , m} such that |/| < fc -|- 1 and optfDi^xj [a:/]) < 0. 



Proof. In order to facilitate the proof, we first introduce two slight modifications 
of problem T>[x]. Let V'[x] be the problem obtained from P[x] by substituting 
the single constraint 

m 

i=l 

for the m constraints 



Vf = l,...,m:— l<di<l . 
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Problems 2? [a;] and 'D'lx] exhibit the following relationship: 



- If d is a feasible solution for V[x\ of cost c, then Md is a feasible 

solution for V'lx] of cost clY^=\ Md- 

- Clearly, each feasible solution for is also a feasible solution for 2?[a:]. 

Thus, there is a feasible solution of negative cost for 2?[a;] iff there is one for 
V'[x\. We may therefore conclude that opt{V'[x]) < 0 iff opt(2?[a;]) < 0. 

It will still be more convenient to consider another modification of T>[x] that we 
denote as V''[x] in the sequel: 



min 
d+ ,d~ 



( V/(a:) 



T 




subject to 



Vz = 1, . . 


. ,m : Xi = li ^ d^ =0 


, x^ — =b d^ — 0 


(16) 




+ 1 
1 


= 0 


(17) 




m 

d'l -b d^ 


= 1 


(18) 




d+,d" 


> 0 


(19) 



'D'lx] and V'lx] are easily seen to be equivalent by making use of the relation 
d = — d~ . Thus, [x]) < 0 iff opt(P'[x]) < 0. What is the advantage of 

dealing with V'lx] instead of The answer is that V'lx] is a linear program 

in canonical form with very few equations. To see this note first that we need 
not count the equations in (16) since the variables that are set to zero there can 
simply be eliminated (thereby passing to a lower-dimensional problem). Thus, 
there are only fc -I- 1 equations left in (17) and (18). It follows that each basic 
feasible solution for 27" [x] has has at most k + 1 non-zero components. 

We are now prepared to prove the lemma. Assume that a; is a feasible but 
suboptimal solution of V. We may conclude from Lemma 3 that opt (27 [x]) < 0 
and, therefore, opt (27" [x]) < 0. If d+,d“ represent the optimal basic feasible 
solution for 27" [x] (with at most fc -I- 1 non-zero components and negative cost) , 
we obtain the feasible solution d = d^ — d~ for 27' [x] that also has at most 
fc -|- 1 non-zero components and (the same) negative cost. Consider working set 
I = {i £ {I, . . . , to} I di yf 0}. Clearly, d is is also a feasible solution for 27/ [x/] 
such that Vf{x)Jdi = V/(x)^d < 0. Thus, < 0, which completes 

the proof. □ 
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Combining Lemma 6 with Corollary 2, we get 

Corollary 3. If x is a feasible but non-optimal solution for V , then there exists 
a working set I of size at most k + 1 such that x is not optimal for Vi x i and 
Ci{x) = Ci,xj{xi) > 0. 

5 Final Remarks and Open Problems 

Chang, Hsu, and Lin prove the convergence for a decomposition method that 
is tailored to the optimization problem (1) except that the cost function may 
be an arbitrary continuously differentiable function [5]. They apply techniques 
of “projected gradients”. Although their analysis is tailored to problem (1), we 
would like to raise the question whether the techniques of projected gradients 
can be used to extend our results to a wider class of cost functions. 

The function C{x) defined in (7) is easily seen to bound f{x) — f{x°°) from 
above. In this sense it measures (an upper bound on) the current distance from 
optimum. Schblkopf and Smola have proposed to select the working set / whose 
indices point to the (at most q) largest terms in C{x) [23]. This policy for working 
set selection looks similar to ours (but the policies are, in general, not identical). 
The question whether the (somewhat simpler) policy proposed by Schblkopf and 
Smola makes sequence a;® converging to an optimal limit point remains open 
(although we cannot rule out that both policies actually coincide for the specific 
problems resulting from SVM applications). 

The most challenging task for future research is gaining a deeper understand- 
ing of the trade-off between the following three goals: 

- efficiency of working set selection 

- fast convergence to optimum 

- generality of the arguments 

It would be nice to lift the decomposition method from SVM applications to a 
wider class of optimization problems without much loss of efficiency or speed of 
convergence. 



Acknowledgments. Thanks to Dietrich Braess for pointing us to a simpli- 
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comments and suggestions and for drawing our attention to the random sampling 
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Abstract. This paper introduces a new method using dyadic decision 
trees for estimating a classification or a regression function in a multi- 
class classification problem. The estimator is based on model selection 
by penalized empirical loss minimization. Our work consists in two com- 
plementary parts; first, a theoretical analysis of the method leads to de- 
riving oracle-type inequalities for three different possible loss functions. 
Secondly, we present an algorithm able to compute the estimator in an 
exact way. 



1 General Setup 

1.1 Introduction 

In this paper we introduce a new method using dyadic decision trees for estimat- 
ing a classification or a regression function in a multiclass classification problem. 
The two main focuses of our work are a theoretical study of the statistical prop- 
erties of the estimator, and an exact algorithm used to compute it. 

The theoretical part (section 2) is centered around the convergence properties 
of piecewise constant estimators on abstract partition models (generalized his- 
tograms) for estimating either a classification function or the conditional proba- 
bility distribution (cpd) P{Y\X) for a classification problem. A suitable partition 
is selected by a penalized minimum empirical loss method and we derive oracle 
inequalities for different possible loss functions: for classification, we use the 0-1 
loss; for cpd estimation, we consider the minus-log loss, and the square error 
loss. These general results are then applied to dyadic decision trees. In section 3, 
we present an algorithm able to compute in an exact way the solution of the 
minimization problem that defines the estimator in this case. 
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1.2 Related Work and Novelty of Our Approach 

The oracle-style bounds presented here for generalized histograms for multiclass 
problems are novel up to our knowledge. Our analysis relies heavily on [1] which 
contains the fundamental tools used to prove Theorems 1-3. For classification, 
Theorem 1 presents a bound for a penalty which is not inverse square-root in 
the sample size (as is the case for example in classical VC theory for consistent 
bounds, i.e. bounds that show convergence to the Bayes classifier of a SRM 
procedure when sample size grows to infinity) but inverse linear, thus of strictly 
lower order. This holds under an identifiability assumption of the maximum 
class, akin to Tsybakov’s condition (see [2] and [3]). For cpd estimation, result 
of Theorem 3 seems entirely novel in that it states an oracle inequality with 
the Kullback-Leibler (K-L) divergence on both sides. In contrast, related results 
in [4,5] for density estimation had the Bellinger distance on the left-hand side. 
Dyadic trees for density estimation have also been recently studied in [6] with a 
result for convergence in L^. 

Traditional CART-type algorithms [7] adopt a similar penalized loss ap- 
proach, but do not solve exactly the minimization problem. Instead, they grow 
a large tree in a greedy way, and prune it afterwards. Some statistical properties 
of this pruning procedure have been studied in [8] . More recently, an exact algo- 
rithm for dyadic trees and related theoretical analysis for classification loss has 
been proposed in [9,10]. It differs fundamentally from the algorithm presented 
here in that the directions of the splits are fixed in advance in the latter work, 
so that the procedure essentially reduces to a pruning. It is also different in 
that the authors do not make any identifiability assumption and therefore use a 
square-root type penalty (see discussion in section 2.3). 

On the algorithmic side, the novelty of our work resides on the fact that we 
are able to treat the case of arbitrary direction choice for the splits in the tree. 
This allows for a much increased adaptivity of the estimators to the problem as 
compared to a fixed-directions architecture, particularly if the target function is 
very anisotropic, e.g. if there are irrelevant input features. 



1.3 Goals 

We consider a multiclass classification problem modeled by a couple of variables 
(X,Y) G X xy with X = [0,1]'* and a finite class set y = {!,..., t}. We 
assume that we observe a training sample (W, of size n, drawn i.i.d. 

from some unknown probability P{X, Y). We are interested in estimating either 
a classification function or the cpd P{Y\X). Estimation of the cpd can be of 
practical interest of its own or can be used to form a derived classifier by “plug- 
in”. It is generally argued that such plug-in estimates can be suboptimal and 
that one should directly try to estimate the classifier if it is the final aim (see 
[11]). However, even if classification is the goal, there is also some important 
added value in estimating P{Y\X): 

— it gives more information to the user than the classification function, allowing 
for a finer appreciation of ambiguous cases; 
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— it allows to deal with cases where the classification loss is not the same for 
all classes. In particular, it is more adapted when performance is measured 
by a ROC curve. 

To qualitatively measure the fit of a function / to a data point (X, Y), a 
loss function £(/, X,Y) € K is used. The goal is to be as close as possible to the 
function /* minimizing the average loss: 

r = AvgMinE[£{f,X,Y)], 

/e.7^ 

where the minimum is taken over some suitable subset T of all measurable 
functions. We consider several possible loss functions, this will be detailed in 
section 1.6. ^ 

If a function / is selected by some method using the training sample, it is 
coherent to measure its closeness to /* by the means of its excess (average) loss 
(also called rzsfc): 

L(£, 7, r ) = E [£{f, X, Y)] - E [£{r,X, Y)] ; 

our theoretical study is focused on this quantity. 

1.4 Bin Estimation and Model Selection 

We focus on bin estimation, i.e. the estimation of the target function using a 
piecewise constant function with a finite number of pieces, which can be seen 
as a generalized histogram. Such a piecewise constant function / is therefore 
characterized by a finite measurable partition B of the input space X - each 
piece of the partition will hereafter be called a bin - and by the values fb^y taken 
on the bins for b £ B,y £ y-. 

f{x,y) = ^l{^eb}fb,y (1) 

beB 

Once a partition is fixed, it is natural to estimate the parameters fb,y using 
the training sample points which are present in the bin: we therefore define the 
following counters for dll b £ B,y £ y-. 

n n 

^b,y = '^^{Xieb-,Yi=y} and Nb = ^If^iGb} = ^ ^b,y 

i=l i=l yey 

Of course, the crucial problem here is the choice of a suitable partition, which 
is a problem of model selection. Hereafter, we identify a model with a partition: 
an abstract model will be denoted by m, and the associated partition by Bm] |w| 
denotes the number of pieces in Bm- The set of piecewise constant real functions 
on bins of Bm (he. of the form (1)) will be denoted Qm- Similarly, the set of 
classification functions which are piecewise constant on Bm will be denoted Cm ■ 
Finally, the set of piecewise constant densities on Bm will be denoted Tm'- 

Xm = If £Gm '^x£ X, ^ f{x, y) = l 

I y 
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1.5 Dyadic Decisions Trees 

Our goal is to consider specific partition models generated by dyadic decision 
trees. A dyadic decision tree is a binary tree structure T such that each internal 
node of T is “colored” with an element of {1, . . . , d} (recall d is the dimension of 
X = [0, 1]'^). To each node (internal or terminal) of T is then associated a certain 
bin obtained by recursively splitting [0, 1]"^ in half along the axes, according the 
colors at the internal nodes of T . This is defined formally in the following way: 

1. To the root of T is associated [0, l]"^. 

2. Suppose s is an interal node of T, and that a bin of the form b{s) = 0^=1 

is associated to s, where the (Ij) are dyadic intervals on the different axes of X. 
Let ks be the color of s, then the bins associated to the right and left children 
nodes rs,is of s are obtained by cutting b{s) at its midpoint perpendicular to 
axis ks', in other words, b{rs) is obtained by replacing in the product defining 
b{s) interval Ik, by its right half-interval, and correspondingly for b{Ia). 

Finally, the partition model generated by T is the set of bins attached to the 
terminal nodes (leaves) of T. 

1.6 Loss Functions 

We investigate three possible loss functions. For classification problems, we con- 
sider the set of classifier functions declass. = {f ■ ^ ^ y} £^nd the 0-1 loss: 

4iass.(/, A, y) = (2) 

The corresponding minimizer of the average loss among all functions from 

A to y is given by the Bayes classifier f*iass (^) = ArgMaxF(y = y\X = x) 

vey 

(see e.g. [11]). 

For cpd estimation, we consider the set iFcpd of functions which are condi- 
tional probabilities of Y given X, i.e. functions A x — >■ M+ which are measur- 
able and satisfy 1 ^or all a; G A. In this case we use one of two 

possible loss functions: the minus-log loss 



Ilog{f,X,Y) = -log{ f{X,Y)), 



(which can possibly take the value -l-oo) and the square loss 

4,(/, A,y) = (1 - f(X,Y)f + /(XjX = \\f{X,-)-Y 



( 3 ) 

( 4 ) 



where ll-jlj is the standard Euclidian norm in K* and Y is the Y-th canonical 
base vector of K*. It is easy to check that the function minimizing the 
average losses E£iog{f,X,Y) and E£sg{f,X,Y) over Ecpd is indeed fcpdi^^y) = 
P{Y = y\X = x). The corresponding excess losses from / to are then given, 
respectively, by the average K-L divergence given X: 



L{£logJJ*cpd) 



Ep 



log 



( p{Y\x) y 

\f{xp))_ 



KL{P,f\X), 



( 5 ) 
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and the averaged squared euclidian distance in R*: 



m,,Ut^a) = Ep(x) \\f{X,-)-P{Y = -\X)\\] =||/-/,V 



2 

t,2 ' 



(6) 



Finally, we will make use of the following additional notation: £(f) is a short- 
cut for £(f, •, •) as a function of X and Y ; we denote the expectation of a function 
/ with respect to P either by ifp [/] or Pf; P„ denotes the empirical distribution 
associated to the sample. 



2 Theoretical Results for the Bin Estimators 

2.1 Fixed Model m 

First let us assume that some fixed model m is chosen. We now define an es- 
timator associated to this model and depending on the loss function used. The 
classical empirical risk minimization method consists in considering the empiri- 
cal (or training) loss 

1 ” 

Pn£{f) = - y2£{f,X„Yi), (7) 

1 

and selecting the function attaining the minimum of this empirical loss over the 
set of functions T^a in the model. When using the classification loss, this gives 
rise to the classifier minimizing the training error: 

^(x) = ArgMax ^ (8) 

when using the square loss or the minus- log loss (3), this gives rise to 

fm{x,y)= (9) 

In case of an undefinite ratio 0/0 in the formula above, one can choose arbitrary 
values for this bin, say 1/t for all classes. 

In the case of the minus-log loss, notice that the loss has infinite average 
whenever there is a bin b such that Nt^y = 0 but P{Y = y\X G b) 0. This 
motivates to consider the following slightly modified estimator which bypasses 
this problem: 

7m= (10) 

where p is some small positive constant. Typically, we can choose p of order 
0{n~^) (see discussion after Theorem 3) for some arbitrary but fixed k (to fix 
ideas, say fc = 3), so that the two functions will be very close in all cases. 
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2.2 Model Selection via Penalization 

Now we address the problem of choosing a model m. A common approach is to 
use a penalized empirical loss criterion, namely selecting the model m such that 

m = Arg Min |p„^(7^) +pen(m) I , (11) 

where pen is a suitable penalization function. For the standard CART algorithm, 
the penalization is of order a|m|. The goal of the theoretical study to come is 
to justify that penalties of this order with estimators defined by (11) lead to 
oracle-type bounds for the respective excess losses. Note that we must assume 
that the exact minimization of (11) is found, or at least with a known error 
margin, which typically is not the case for the greedy CART algorithm. We will 
show in section 3 how the minimization can be solved effectively for dyadic trees. 

2.3 Oracle Inequalities for the Penalized Estimators 

Classification Loss. In the case of classification loss, it has been known for 
some time [2,3] that the best convergence rates in classification strongly depend 
on the behavior of P{Y\X) and in particular of the identifiability of the majority 
class. Without any assumption to this regard, the minimax rate of convergence 
for classification error is of order D jn) for a model of VC-dimension D (see 
e.g. [11]), and thus the penalty should be at least of this order. Such an analysis 
has been used in [9] for dyadic classification trees. Presently, we will assume 
instead that we are in a favorable case in which the majority class is always 
identifiable^ with a fixed known “margin” ryo) which allows to use a smaller 
order penalty {0{\m\/n)). Moreover, this additive (wrt. the size of the model) 
penalty makes the minimization problem (11) easier to solve practically. Note 
that the identifiability assumption is only necessary for classifier estimation in 
Theorem 1, not for cpd estimation in Theorems 2-3. 

Theorem 1. Assume the following identifiability condition: there exists some 
rjQ > 0 such that 

Vx G A, P{Y = ffiass{x)\X = x)~ max P{Y = y\X = x) > go- (12) 

Let (xm)mGM numbers with — 1- Then for any K > 1, 

there exist absolute constants ( 71 ,( 72 , C 3 such that, if 

WmGM pen(m) > (7i + c *2 — (13) 

rjon rjon 

then the penalized estimator fff, satisfies 

E [err(T^) - err(/*;„^jj < K inf (err(/) - err(/;;„^J -fi 2pen(m) -k — 

L J mGM \ n 

/ec„, ^ 

^ Note that this identifiability assumption (12) below is much weaker than the as- 
sumption that the Bayes error is zero, which appears in classical VC theory to 
justify non-square-root penalties for consistent bounds and SRM procedures. 
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where err denotes the generalization error and the expectation on the left-hand 
side is over training sets 



Square Loss 

Theorem 2. Let (xm)mGM be real numbers with < 1- Then 

for any K > 1, there exist absolute constants Ci,C 2 ,C 3 such that, if 

Wm G M pen(m) > + C 2 — (14) 

n n 

then the penalized estimator ffh satisfies 

<K inf (" inf ||/ - /* + 2pen(m) + — V 

mGM n J 





2 


ffh — fcpd 


t,2 



Minus-log Loss 

Theorem 3. Let (xm)mGM be real numbers with < 1- Then 

for any K > 1, there exist absolute constants C\,C 2 ,C^ such that, if 



Vm G M 



pen(m) > Ci h C 2 

n n 



(15) 



then the penalized estimator satisfies 
E \KL{P,f^\X)] < K inf (KL{P,f\X) + 2pen(m) + ^ - 31og(l - tp)\ . 

L J \ 71 J 

Note that the typical values of p should be of order n~^ for some arbitrary 
fc > 0. Assuming the number of models per dimension is at most exponential, 
the penalty function is then of order t|m| log n/n, and the trailing term log(l— tp) 
is of order t/n^ . 



Application to Dyadic Decision Trees 

Corollary 1. For dyadic decision trees in dimension d. Theorems 1-3 apply 
with the choice 

Xm = C\m\log{d), (16) 

where C is a universal constant. 

Proof. The point here is only to count the number of models of size |m| = D. 
An upper bound can be obtained the following way: the number of binary trees 
with D -\-l leaves is given by the Catalan number Cat{D) = [D -\- 1)“^(^^); 
such a tree has D internal nodes and we can therefore label these nodes in d^ 
different ways. It can be shown that Cat{n) < jr?^'^ for some constant C"; 

hence for C big enough in (16), < 1 is satisfied. □ 
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3 Implementation of the Estimator 

Principle and naive approach. We hereafter assume that the penalization 
function is on the form pen(m) = a|m| for some a (possibly depending on the 
sample size n). 

In traditional CART, no exact minimization is performed. The split at each 
node is determined in a greedy way in order to yield the best local reduction 
of some empirical criterion (the entropy criterion corresponds to £iog while the 
Gini criterion corresponds to £sg)- In contrast, we introduce a method to find 
the global solution of (11) for dyadic decision trees by dynamic programming. 
This method is strongly inspired from an algorithm proposed by Donoho [12] for 
image compression. 

We assume that there is a fixed bound kmax on the maximal numbers of cuts 
along a same dimension. Therefore, the smallest possible bins are those obtained 
with kmax cuts in every dimension, i.e. small hypercubes of edge length 
We represent any achievable bin by a d-tuple b = (Li{b), . ■ ■ , Ld{b)), where 
for each i, Li{b) is a finite list of length 0 < |Li| < kmax, with elements in 
{r,£}. Each of these (possibly empty) lists contains the successions of cuts in 
the corresponding dimension needed to obtain the bin; each element of the list 
indicates if the left or the right child is selected after a cut, see section 1.5. 
Note that, while the order of the sequence of cuts along a same dimension is 
important, the order in which the cuts along different dimensions are performed 
is not relevant for the definition of the bin. Finally, we will denote |6| = \ 

and call it the depth of cell b, and the set of achievable bins, i.e. such that 

\Li{b)\ < kmax for all 1 < i < d. 

The principle of the method is simple, and is based on the additive property 
of the function to be optimized. If 6 is a bin, denote Tf, a “local” dyadic tree 
rooted in 6, i.e. a dyadic tree starting at bin b and splitting it recursively, while 
still satisfying the assumption that the bins attached to its leaves belong to 
^kmax- Furthermore we assume that to each terminal bin a value is associated 
estimated from the data, such as (10), so that Tf, can be considered as a piecewise 
constant function on b. Denote |Th| the number of leaves of Tf, and define 



S{n) = Y,hx,^t}£{Tb,Xi,Yi) + na\n\. 

Note that when b = [0,1]"^, finding the minimum of £{T) is equivalent to the 
minimization problem (11). Moreover, whenever Tf, is not reduced to its root 
(hereafter we will call such a tree nondegenerate), if we denote u and v the bins 
attached to the left and right children of the root and T„, T„ the corresponding 
subtrees, then we have 

8{Tb)=£{T,,)+£{n). 

For a bin b, let denote the local dyadic tree minimizing £{Tb). Finally, let us 
denote by b\, b\. the left and right sub-bins obtained by splitting b in half along 
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direction i. Then from the above observations it is straightforward that 



£ (Tft* ) = min {s (Rb) , min {£{^) + £(t;0 



f.\U{B)\<k 

max 




( 17 ) 



where Rb denotes the degenerate local tree {b}. 

From this it is quite simple to develop the following naive bottom-up ap- 
proach to solving the optimization (11): suppose we know the optimal local tree 
for every bin of depth |6| = k, then using (17) we can compute the optimal 
local trees for all bins at depth fc — 1. Starting with the deepest bins (the hyper- 
cubes of side length 2“^'"““) for which the optimal local trees are degenerate, 
it is possible to compute recursively optimal trees for lower depth bins, finally 
finding the optimal tree T* for [0, 1]'^. 



Dictionary-based approach. The naive approach proposed above however 
has a significant drawback, namely its complexity; there are already small- 

est bins at depth dkmax, and even more bins for intermediate depth values, due 
to the combinatorics in the choice of cuts. We therefore put forward an improved 
approach, based on the following observation: if 2'^^’”“* > n, then some (possibly 
a lot) of the smallest bins are actually empty, and so are bins at intermediate 
depths as well. Furthermore, for an empty bin b at any depth the optimal lo- 
cal tree is obviously the degenerate tree = Rb- Therefore, it is sufficient to 
keep track of the non-empty bins along the process. This can be done using a 
dictionary T>k of non-empty bins of depth k; the algorithm is then as follows: 



Initialization: construct dictionary Rdkmax by finding the minimal bins (hy- 
percubes of edge length 2“^"*“*) containing at least one datapoint, and insert- 
ing them in Vdkrnax- For each of these bins b, also store that = Rb- 
Loop on depth, D = dkmax, . . . , 1: 

Initialize T>d-i = 0- 
Loop on elements b €T>d- 

Loop on dimension k G {1, . . . , d} and |Tfe(6)| > 0: 

Let b' denote the sibling of b along dimension k, i.e. the bin obtained 
from b by flipping the last element of Lk{b). Let u denote the direct 
common ancestor-bin of b and b' . 

If u is already stored in T>d-i with a (provisional) T*, then replace 

T* ^ Arg Min {£{T:),£{Tb*) + £{Tb*,)) . 

If u is not yet stored in T>jy_i, store it along with the provisional 

T: ^ ArgMin {£{R^),£{Tb*) + £{Tt,)) . 

Endloop on k 
Endloop on b 
Endloop on D 
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It is straightforward to prove that at the end of each loop over b, 2?_d_i 
contains all nonempty bins of depth D — 1 with the corresponding optimal local 
trees. Therefore at the end of the procedure T>q contains the tree minimizing the 
optimization problem (11). 

We now give a result about the complexity of our procedure: 

Proposition 1. For fixed training sample size n > 1, input dimension d > 
1, maximum number of splits along each dimension k^ax > 1; the complexity 
C{n,d,kmax) of the dictionary-based algorithm satisfies 

^ (dktax) < C{n, d, kmax) < o {ndk^^^ log{nk^^^)) . (18) 

Proof. For a given training point (Xi,Yi), the exact number of bins (at any 
depth) that contain this point is {kmax + l)*^- Namely, there is a unique bin 6 q 
of maximal depth dkmax containing (Xj,l^); then, any other bin b containing 
this point must be an “ancestor” of &o in the sense that for all 1 < fc < d, Tfe(6) 
must be a prefix list of Lfe(6o). Bin b is uniquely determined by the length of 
the prefix lists |Lfc(6)|, 1 < A: < d; for each length there are {kmax + 1) possible 
choices, hence the result. 

Since the algorithm must loop at least through all of these bins, and makes 
an additional loop on dimension for each bin, this gives the lower bound. For 
the upper bound, we bound the total number of bins for all training points 
by 0{nk‘^). Note that we can implement a dictionary T> such that search and 
insert operations are of complexity 0(log(|P|)) (for example an AVL tree, [13]). 
Coarsely upper-bounding the size of the dictionaries used by the total number 
of bins, we get the announced upper bound. □ 

Retaining nkm^^ as the leading factor of the upper bound, we see that the 
complexity of the dictionary-based algorithm is still exponential in the dimen- 
sion d. To fix ideas, assume that we choose kmax so that the projection of the 
training set on any coordinate axis is totally separated by the regular grid of 
size 2“^’"“®. If the distribution of X has a bounded density wrt. Lebesgue mea- 
sure, kmax should be of order log(n) and the complexity of the algorithm of 
order nlog (n) (in the sense of logarithmic equivalence). Although it is much 
better than looping through every possible bin (which gives rise to a complex- 
ity of order « n*^), it means that the algorithm will only be viable 

for low dimensional problems, or by imposing restrictions on kmax for moderate 
dimensional problems. Note however that other existing algorithms for dyadic 
decision trees [9,10,6] are all of complexity 2'^^’"“®, but that the authors choose 
kmax of the order of d“^logn. This makes sense in [10], because the cuts are 
fixed in advance and the algorithm is not adaptive to anisotropy. However, in 
[6] the author notices that kmax should be chosen as large as the computational 
complexity permits to take full advantage of the anisotropy adaptivity. 

4 Discussion and Future Directions 

The two main points of our work are a theoretical study of the estimator and 
a practical algorithm. On the theoretical side. Theorems 1-2 are “true” oracle 
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inequalities in the sense that the convergence rates for each of the models con- 
sidered is of the order of the minimax rate (for a study of minimax rates for 
classification on finite VC-dimension models under the identifiability condition 
(12), see [3]). Theorem 3 misses the minimax rate, which is known to be of order 
0(|m|/n), by a logarithmic factor. We do not know at this point if this factor 
can be alleviated. Another interesting future direction is to derive from these in- 
equalities convergence rates for anisotropic regularity function classes, similarly 
to what was done in [6,12]. 

From the algorithmic side, our algorithm is arguably only viable for low- or 
moderate-dimensional problems (we tested it on 10-dimensional datasets). For 
application to high-dimensional problems, some partly-greedy heuristic appears 
as an interesting strategy, for example by splitting the algorithm into several 
lower-dimensional problems on which we can can run the exact algorithm. We 
are currently investigating this direction. 



Acknowledgments. The authors want to thank Lucien Birge and Klaus- 
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A Proofs of Theorems 1-3 

The proofs for our results are based on a general model selection theorem ap- 
pearing in [14], which is a generalization of an original theorem of Massart [1]. 
We quote it here in a slightly modified and shortened form tailored for our needs 
(see also [15] for a similar form of the theorem). 

Theorem 4. Let £{■, •) be a loss function defined on S x X; 

denote f* = ArgMinP£(/) and L{f,f*) = P£{f) - P£{f*). Let {Sm)m^M, 
f(^S 

Sm C S be a countable collection of classes of functions and assume that there 
exists 

— a pseudo-distance d on S; 

— a sequence of sub-root ^ functions {(j>rn),'rn € A4 ; 

— two positive constants b and R ; 

such that 

(HI) V/g5, VxGT, \£{f,x)\<b-, 

(H2) V/, /' G 5, Varp [£{f) - £{f')] < d^ (/, f) ; 

(H3) V/g5, d\f,n<RL{f,n- 

and, if r^ denotes the solution of 4>rn{‘>') = TfR, 



(H4) Vm G M, V/o G Tm, Vr > r* 



E 



sup (P-P„)(4/)-£(/o)) 

/ m 



< 4>m{r). 



Let (Xm)meM t’eal numbers with ® < 1- Let £ > 0 and f denote 

an e-approximate penalized minimum loss estimator over the family {Pm) with 
the penalty function pen(m), that is, such that there exists ffi with f G Pfh and 

Pn£{f) + pen(m) < inf inf {Pn£{f) + pen(m) -I- £) . 
meM 



Given K > 1, there exist constants Ci,C 2 ,C^ (depending on K only) such that, 
if the penalty function pen(m) satisfies for each m € A4: 



/ N ^ ^ ^ {P + b)x„ 

pen(m > Ci^ + C 2 - — 

R n 



A function f on R+ is subroot if it is positive, nondecreasing and 4'{r)/^/r is nonin- 
creasing for r > 0. 
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then the following inequality holds: 

EL{f,f*)<K inf (inf L(/, /*) + 2pen(m) + — + eV 
meM n ) 

Proof outline for Theorem 1. We will apply Theorem 4 to the set of 
models (Cm)- Checking for hypothesis (HI) is obvious. To check (H2)-(H3), 
we choose the distance d(f,g) = E [(£dass(f,X,Y) - idassig, X,Y))'^], so that 
(H2) is trivially satisfied. To check (H3), denote rj{x,i) = P(Y = i\X = x) and 
r/*(x) = maxi^y Tj(i, x); we then have 

^ [^{f(x)^Y} - I{/*(x)#v}] = E [(r]*iX) - rj{X,f{X))) ^{f{x)^f-{x)}] 

> rioE [l{/(x)//*(x)}] , 
where we have used hypothesis (12). On the other hand, 

^ = E [(t 7 *(X) + rt(X,f(X))) I{/(X) 5 ^/*(X)}] 

< 2E [l{/(x)//*(x)}] , 

which proves that (H3) is satisfied with R = 2/r]o. Finally, for hypothesis (H4), 
we can follow the same reasoning as in [1], p. 294-295; in this reference the 
empirical shattering coefficient is taken into account, but the present case is 
even simpler since model Cm is finite with cardinality leading to 

E sup (P - Pn)(idass(f) - idaUfo)) <C 

JeCm,d^(fJo)<r 

for some universal constant C. This leads to the conclusion. □ 

Proof outline for Theorem 2. We apply Theorem 4 to the set of models 
(Pm)- For (HI), it is easy to check that 

V/ G Pcpd, X, Y) = \\f(X, •) - F||" = \\f(X, Oil? + 1 - 2 fix, Y) < 2. 

For (H2), we note that 4,(/,X,T) - £sq(g,X,Y) = ||/(X,-)||? - \\giX,■)f^ - 
2(f(X, Y) — g(X, F)). Using the equality Var [F] = E [Var [F|X]]-|-Var [E [F|X]], 
we deduce that 

Xav[isqif,X,Y)-e,g(g,X,Y)] 

= E [Var [2(/(V, Y) - g(X, F))|X]] + Var [||/(V, .)||? - ||<?(V, .)||?' 

< 4F [(/ - g)2] + E [\\f(X, •) - g(X, .)||? ||/(V, •) + g(X, .)||?' 

<8F [||/(V,.)-5(V,.)||?] =d^if,g); 

this proves that (H2) is satisfied for the above choice of d; recalling (6), (H3) is 
then satisfied with R = 1/8. Finally, for hypothesis (H4) is is possible to show 
that 

E sup (P - Pn)(£sqif) - ^sqifo)) 

JeQm,dRfJo)<r 
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using local Rademacher and Gaussian complexities, using a method similar to 

[14]. □ 

Proof of Theorem 3. To apply Theorem 4, we define the ambient space 

5^ = {/ G Tcpd\^{x,v) G T X 3^, f{x,y) > p} 

and the models as = SPC\Trm which will insure boundedness of the loss. As a 
counterpart of using these restricted ambient space and models, the application 
of Theorem 4 will result in an inequality involving not but the minimizer 
of the average loss on 5^, denoted /*, and the model- wise minimizers of the loss 
on instead of Tm- However, it is easy to show the following inequalities: 

V/ G T,pd, Hf, < L(/, /;) - log(l - tp); 

Vm G M, inf L{f, f*) < inf L{f, f*pj) - log(l - tp)] 

finally, it can be shown that is a — log(l — t/9)-approximate penalized esti- 
mator. Therefore, if Theorem 4 applies, these inequalities lead to the conclusion 
of Theorem 3. 

We now turn to verifying the main assumptions of the abstract model selec- 
tion theorem. 

• Check for (HI): boundedness of the loss on the models. Obviously, we have 

Vf €SP,\/{x,y) € X xy 0 < (logif, x,y) <- log p 

• Check for (H2)-(H3): distance linking the risk and its variance. We choose 
the distance d as the LF'{P) distance between logarithms of the functions: 

d{f, g) = Ep [{kogU, X, y) - £iog{g, X, y)f] = Ep log^ ^ . 



Obviously we have Var[£iog{f,x,y) — £iog{g,x,y)] < d{f,g) with this choice; the 



2 P(Y\X) 



to Ep log 



P{Y\X) 



problem is then to compare Ep log^ -j to Ep log -j . Denoting 

Z{x,i) = f{x,k)/P{Y = k\X = x), we therefore have to compare E\log^ Z] 
to E[—logZ] with the expectation taken wrt. P, so that E[Z] = 1. Note that 
Z > p. Using Lemma 1 below, we deduce that 

d{P{Y\X)J) < E^^^KL{PJ\X), 

Note that typically when p is small the factor R in (H3) is therefore of order 
-logp. 

• Check for (H4): d-local risk control on models. For any f,g G S^, E = 
f 

log — G Gm- For A G Bm, i &y, denote Pa i = P[X G A,Y = i] and 

g 



VaA^,v) = 



I{a; G A}I{y = i} 
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note that the family i))A,i is an orthonormal basis (for the L'^{P) structure) 
of Qm, hence any function / G Qm can be written under the form 

f ~ ^ ^ j 

A,i 



with Pp = Putting Vn = {P — Pn), we then have for any f G 









sup \iyn{i{f,x) - i{g,x))\ 


< Ep 


sup |zz„F| 


gest;. 




F^Qrn 


-P(f,g)<r 




_E[F'^]<r 



^ — Ep 


sup 


^ ^ C^A,i Fn4^A,i 








A,i 






T,A,i°^A,i<r 







< ^/rEp 



< ^Ep 



] (l^n<fA,i) 



1 

2 






1 

2 



\ 



■H- 

rj 



1 -P-A,i(l — PA,^) 



A,i 



Pa, 



< 




□ 

The following Lemma is inspired by similar techniques appearing in [4,16]. 

Lemma 1. Let Z he a real, positive random variable such that E[Z] = 1 and 
Z >7] a.s. Then the following inequality holds: 

E [log^ Z] ^ log^'q 
E[-\ogZ] ~ rj - 1 -log? 7 ' 

Proof. Let u = — log Z < — log ip, we have 



E[-logZ] = E[u] = E[e-^ 



l + u]=E 



— 1 + u 



^ p r 21 ^ - 1 - log 

> E [u \ 2 ) 

log t] 



where the first line comes from the fact that E [e~“] = E [Z] = 1, and the last 
inequality from the fact that the function g{x) = — 1 + x) is positive 

and decreasing on R. □ 




An Improved VC Dimension Bound for Sparse 

Polynomials 



Michael Schmitt 

Lehrstuhl Mathematik und Informatik, Fakultat fiir Mathematik 
Ruhr-Universitat Bochum, D-44780 Bochum, Germany 
http : //www . ruhr-uni-bochum . de/lmi/mschmitt/ 
mschmittOlmi . ruhr-uni-bochum. de 



Abstract. We show that the function class consisting of fc-sparse poly- 
nomials in n variables has Vapnik-Chervonenkis (VC) dimension at least 
nk + 1. This result supersedes the previously known lower bound via 
fc-term monotone disjunctive normal form (DNF) formulas obtained by 
Littlestone (1988). Moreover, it implies that the VC dimension for k- 
sparse polynomials is strictly larger than the VC dimension for fc-term 
monotone DNF. The new bound is achieved by introducing an exponen- 
tial approach that employs Gaussian radial basis function (RBF) neural 
networks for obtaining classifications of points in terms of sparse poly- 
nomials. 



1 Introduction 

A multivariate polynomial is said to be fc-sparse if it consists of at most k mono- 
mials. Sparseness is a prerequisite that has proven to be instrumental in numer- 
ous results concerning the computational aspects of polynomials. Sparse poly- 
nomials have been extensively investigated not only in the context of learning 
algorithms (see, e.g., Blum and Singh, 1990; Bshouty and Mansour, 1995; Fischer 
and Simon, 1992; Schapire and Sellie, 1996), but also with regard to interpola- 
tion and approximation tasks (see, e.g., Grigoriev et ah, 1990; Huang and Rao, 
1999; Murao and Fujise, 1996; Roth and Benedek, 1990). 

The Vapnik-Chervonenkis (VC) dimension of a function class quantifies its 
classification capabilities (Vapnik and Chervonenkis, 1971): It indicates the car- 
dinality of the largest set for which all possible binary- valued classifications are 
obtained using functions from the class. The VC dimension is well established as 
a measure for the complexity of learning (see, e.g., Anthony and Bartlett, 1999): 
It yields bounds for the generalization error of learning algorithms via uniform 
convergence results. 

We establish here a new lower bound on the VC dimension of sparse multivari- 
ate polynomials: We show that the class of fc-sparse polynomials in n variables 
has VC dimension at least nk + 1. The previously best known lower bound is 
derived from the lower bound for Boolean formulas in fc-term monotone disjunc- 
tive normal form (DNF), that is, disjunctions of at most fc monomials without 
negations. This bound has been obtained by Littlestone (1988). In particular. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 393—407, 2004. 
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Littlestone has shown that the class of fc-term monotone l-DNF formulas (i.e., 
with monomials of size at most /) has VC dimension at least ^fc[log(n/m)J , where 
I < m < n, and k < (™). Using, for instance, I = n/4 and m = n/2, this yields 
the lower bound nk/A for the VC dimension of fc-term monotone DNF and, 
hence, of fc-sparse polynomials, where k has to satisfy the given constraints. 

The new bound that we provide here for sparse polynomials supersedes this 
previous bound in a threefold way: 

1. It improves the bound from /c-term monotone DNF in value. 

2. It releases k from the constraints through n in that the bound holds for every 
n and k — in particular, for values of k that are larger than the number of 
monotone monomials. 

3. The value nfc+ 1 is even larger than the VC dimension of the class of fc-term 
monotone DNF formulas itself: We show that the difference between both 
dimensions is larger than k\og(k/e) + 1. 

So far, a considerable number of results and techniques for VC dimension 
bounds have been provided in the context of real valued function classes (see, 
e.g., Bartlett and Maass, 2003, and the references there). For specific subclasses 
of sparse polynomials, tight bounds have been calculated: Karpinski and Werther 
(1993) have shown that /c-sparse univariate polynomials have a VC dimension^ 
proportional to k. Further, the VC dimension of the class of monomials over the 
reals is equal to n (see Ehrenfeucht et ah, 1989, for the lower bound and Schmitt, 
2002c, for the upper bound). There is also a VC dimension result known for 
n- variate d-degree polynomials (see, e.g., Ben-David and Lindenbaum, 1998): 
This class has VC dimension equal to However, as the class contains 

polynomials that are -sparse and fc-sparseness imposes restrictions on the 
number of variables in terms of k, this result entails for sparse multivariate 
polynomials (without constraint on the degree) a lower bound not better than 
the bound due to Littlestone (1988). 

There has been previous work that established techniques for deriving lower 
bounds for quite general types of real- valued function classes. Building on results 
by Lee et al. (1995), Erlich et al. (1997) provide powerful means for obtaining 
lower bounds for parameterized function classes^. An essential requirement for 
using these techniques, however, is that the function class is “smoothly” pa- 
rameterized, a fact that does not apply to the exponents of polynomials. The 
lower bound method of Koiran and Sontag (1997) for various types of neural 
networks, generalized by Bartlett et al. (1998) to neural networks with a given 
number of layers, cannot be employed for polynomials either. This technique 

^ Precisely, Karpinski and Werther (1993) studied a related notion, the so-called 
psendo-dimension. Following their methods, it is not hard to obtain this result for 
the VC dimension (see also Schmitt, 2002a). 

^ A parameterized function class is given in terms of a function having two types of 
variables: input variables and parameter variables. The functions of the class are 
obtained by instantiating the parameter variables with, in general, real numbers. 
Neural networks are prominent examples for parameterized function classes. 
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is constrained to networks where each neuron computes a function with finite 
limits at infinity, a property monomials do not have. Further, Koiran and Sontag 
(1997) designed a lower bound method for networks consisting of linear and mul- 
tiplication gates. However, the way these networks are constructed — with layers 
consisting of products of linear terms^ — does not give rise to sparse polynomials, 
even when the number of layers is restricted. 

We provide a completely new approach to the derivation of lower bounds 
on the VC dimension of sparse multivariate polynomials. First, we establish the 
lower bound nk+1 on the VC dimension of a specific type of radial basis function 
(RBF) neural network (see, e.g., Haykin, 1999). The networks considered here 
have k Gaussian units as computational elements and satisfy certain assump- 
tions with respect to the input domain and the values taken by the parameters. 
The bound for these networks improves a result of Erlich et al. (1997) in combi- 
nation with Lee et al. (1995) who established the lower bound n{k — 1) for RBF 
networks^ with restrictions neither on inputs nor on parameters. Then we use 
our result for RBF networks to obtain the lower bound on the VC dimension of 
sparse multivariate polynomials. Thus, RBF networks open a new way to assess 
the classification capabilities of sparse multivariate polynomials. This Gaussian 
approach has also proven to be helpful in a different context dealing with the 
roots of univariate polynomials (Schmitt, 2004). 

Sparse multivariate polynomials are a special case of a particular type of neu- 
ral networks, the so-called product unit neural networks (Durbin and Rumelhart, 
1989). It immediately follows from the bound for sparse multivariate polynomi- 
als established here that the VC dimension of product unit neural networks with 
n input nodes and one layer of k hidden nodes (that is, nodes that are neither 
input nor output nodes) is at least nk+1. 

Concerning known upper bounds for the VC dimension of sparse multivari- 
ate polynomials, there are two relevant results: First, the bound 0{n'^k^) due 
to Karpinski and Macintyre (1997) is the smallest upper bound known for poly- 
nomials with unlimited degree (see also Schmitt, 2002c). Second, the class of 
fc-sparse n- variate polynomials with degree at most d has VC dimension no 
more than 2nklog{9d) (Schmitt, 2002c). The derivation of the new lower bound 
not only narrows the gap between upper and lower bounds, but gives also rise 
to subclasses of degree-restricted polynomials for which the bound is optimal up 
to the factor 21og(9d). 

We introduce definitions and notation in Section 2. Section 3 provides ge- 
ometric constructions that are required for the derivations of the main results 
presented in Section 4. Finally, in Section 5, we show that the new bound exceeds 
the VC dimension of /c-term monotone DNF. 



® Such a layer uses products of the form where it is crucial that there is 

no bound on 1 . 

^ These results and the one presented here concern RBF networks with uniform width. 
(See the definition in Section 2.) Better lower bounds are known for more general 
types of RBF networks (Schmitt, 2002b). 
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2 Definitions 



The class of k-sparse polynomials in n variables consists of the functions 

. ^11 fii „ . . bu 1 bu „ 

ag + ' ■ ■ ■ ’ + • • • + a^X^ ’ • • • Xn 



with real coefficients ag, . . . ,Uk and nonnegative integer exponents 6ip, . . . , bk,n- 
Note that, in contrast to some other work, the notion of /c-sparseness does not 
include the constant term ag in the value of /c. In the derivation of the bound 
we associate the non-constant monomials with certain computing units of a 
neural network. Thus, the degree of sparseness of a polynomial coincides with 
the number of so-called hidden units of a neural network. 

If the exponents are allowed to be arbitrary real numbers, we obtain the class 
of functions computed by a product unit neural network with k product units. 
In these networks, a product unit computes the term 



bi 

’ 



1 



• • • X 



bi,n 

n 



and the coefficients ag, ... ,ak are considered as the output weights of the network 
with bias ag. 

We use II • II to denote the Euclidean norm. A radial basis function neural 
network {RBF network, for short) computes functions that can be written as 



wg + wi exp — 



\x - Ci| 



+ ■■ ■ + Wk exp — 



\X - Cfcl 



where k is the number of RBF units. This particular type of network is also 
known as Gaussian RBF network. Each exponential term corresponds to the 
function computed by a Gaussian RBF unit with center G IR", where n is the 
number of variables, and width cr G IR \ {0}. The width is a network parameter 
that we assume to be equal for all units, that is, we consider RBF networks 
with uniform width. Further, wg, . . . ,Wk are the output weights and wg is also 
referred to as the bias of the network. 

The Vapnik-Chervonenkis (VC) dimension of a class T of real- valued func- 
tions is defined via the notion of shattering: A set S' C IR” is said to be shattered 
by T if every dichotomy of S is induced by T , that is, if for every pair (S“, S+), 
where S“ fl S+ = 0 and S“ U S+ = S, there is some function f £ if such that 

sgn o /(S“) C {0} and sgn o /(S^) C {!}. 



Here sgn : IR — >• {0, 1} denotes the sign function, satisfying sgn(a;) = 1 if cc > 
0, and sgn(a;) = 0 otherwise. The VG dimension of T is then defined as the 
cardinality of the largest set shattered by T . (It is said to be infinite if there is 
no such set.) 

Finally, we make use of the geometric notions of ball and hypersphere. A ball 
in IR” is given in terms of a center c G IR” and a radius p G IR as the set 



B{c, p) = {x £ IR” : ||a: — c|| < p}. 



A hypersphere is the set of points on the surface of a ball, that is, the set 
S{c, p) = {x £ IR" : ||x — c|| = p}. 
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3 Geometric Constructions 

In the following we provide the geometric constructions that are the basis for 
the main result in Section 4. The idea is to represent classifications of sets using 
unions of balls, where a point is classified as positive if and only if it is contained 
in some ball. In order for being shattered, the sets are chosen to satisfy a cer- 
tain condition of independence with respect to the positions of their elements: 
The points are required to lie on hyperspheres such that each hypersphere is 
maximally determined by the set of points. In other words, removing any point 
increases the set of possible hyperspheres that contain the reduced set. The 
following definition makes this notion of independence precise. 

Definition. A set Q C IR” of at most n -I- 1 points is in general position for 
hyperspheres if the system of equalities 

I|p-c||=r7. forallpeQ, (1) 

in the variables c = (ci, . . . , c„) and rj has a solution and, for every q € Q, the 
solution set is a proper subset of the solution set of the system 

I|p-c||=r7, for allp€Q\{q}. (2) 

Given a set of points that satisfies this definition and lies on a hypersphere, 
we next want to find a ball such that one of the points lies outside of the ball 
while the other points are on its surface. We show that this can be done, provided 
that the set is in general position for hyperspheres. Moreover, the ball can be 
chosen with the center and radius as close as possible to the center and radius 
of the hypersphere that contains all points. 

Lemma 1. Suppose that Q C M" is a set of at most n -I- 1 points in general 
position for hyperspheres and let q G Q. Further, let c € M", q € JR be a solution 
of the system 



\\P - c|| = V, for all peQ. (3) 

Then, for every e > 0, there exists a solution c(e) € M”, q(s) € JR of the system 

\\P- c{e)\\ = q{e), for all p e Q\{q}, 

||q- c(£)|| > q{e) 



satisfying 



||c- c(£)|| < e, 

|?7-77(e)| < e. 
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Proof. Without loss of generality, we may assume that rj > 0. (If r; = 0 then we 
have IQI = 1, and the statement is trivial.) Since c and r] solve the system (3), 
c and -d = — ||c|p are a solution of the system 

IIpIP — 2pc = d, for all p G Q. (4) 

Because Q is in general position for hyperspheres, the solution set of the system 
(4) is a proper subset of the solution set of the system 

11 ^ 11 ^- 2^0 = ^, for all p G Q \ {q}. (5) 

According to facts from linear algebra, there exist a G IR" and a G IR such that 
for every A 0, we have with c + Aa and -d + Aa a solution of the system (5) 
that does not solve the system (4). For a given e > 0, choose A(£) G IR \ {0} 
such that |A(e)| is sufficiently small to satisfy the two inequalities 

||A(e)a||<e, ( 6 ) 

1x7^+ ||c||2 - i/d + A(£r)a+ ||c + A(£r)aP| < e. (7) 

It is obvious that the second inequality can be met due to the fact that the 
equation x/d + ||cP = 77 holds, which we get from the definition of d, and the 
assumption 77 > 0. Since c+ A(e)a and d + A(£)a solve (5) but not (4), it follows 
that 



||qf - 2q(c + A(£)a) d + A(£)a, 
which, using ||q|p — 2qc = d from (4), is equivalent to 

—2\{e)qa y^ A(£)a. 

Due to this inequality, we can choose the (not yet specified) sign of A(£r) such 
that 



— 2A(e)qa > A(£)q;. 

Again with ||q|p — 2qc = d, it follows that 

||q|P - 2q{c + A(£r)a) > d + A(£)q;, 

and, therefore. 



||q - (c+ A(e)a)|p > d + A(£)a + ||c+ A(e)a|p. 



Hence, defining 



c(e) = c + A(e)a, 

77 (e) = x/d + A(e)o; + ||c+ A(e)a|p, 
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we obtain ||q — c(£)|| > r](e). Furthermore, the inequalities (6) and (7) imply 
that the relations 



||c- c(e)|| < e, 

|?7-77(e)| < e 



hold as claimed. □ 

We now apply the previous result to show that any dichotomy of a given 
set of points can be obtained using balls. As the set may generally be a subset 
of some larger set, we also ensure that the balls do not enclose any additional 
point. Further, we guarantee that this can be done with all centers remaining 
positive, a condition that will turn out to be useful in the following section. We 
say here that a vector is positive, if all its components are larger than zero. 

Lemma 2. Let Q C ]R" be a set ofn points in general position for hyperspheres 
and let P C M” he a finite set with Q C P. Assume further that there exists a 
positive center c G IR” and a radius ?7 G IR such that 

Q C S'(c,? 7 ), 

PnB{c,T]) = Q. 

Then for every R C Q there exists a positive center d G IR" and a radius C G IR 
such that 



RCS{d,0, 

PnB{d, C) = R. 

Proof. Clearly, it is sufficient to consider sets R that are proper subsets of Q. 
Without loss of generality, we may assume that |i?| = |(5| — 1. The general case 
then follows inductively. Suppose that q G Q and let R = <5 \ {q}. According to 
Lemma 1, for every e > 0 there exist c(e),rj(e) satisfying 



\\p- c{s)\\ = q{s), for all p G Q\ {q}, 


(8) 


||q-c(e)|| > q(e), 


(9) 


||c-c(e)|| < £, 


(10) 


\V-V{e)\ < e. 


(11) 



Obviously, property (8) implies that R C S{c{e),'q{e)). Property (9) states that 
q ^ B{c{s),r]{e)). Since the assumption P 0 B{c,q) = Q implies that for every 
p' G P\Q the constraint 



||p'-c|| >q 
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holds, properties ( 10 ) and ( 11 ) entail the condition 

||p'-c(£r)|| >Tj{e) 

for all sufficiently small e. Thus, for any such e we get the assertion 

P n B{c{s),rj{e)) = R. 

Further, as c is positive, property (10) ensures that c{e) is positive for some 
sufficiently small Hence, the claim follows for d = c(e), C = 77 (e). □ 



4 VC Dimension Bound for Sparse Multivariate 
Polynomials 

Before getting to the main result, we derive the lower bound nk + 1 for the VC 
dimension of a restricted type of RBF network. For more general RBF networks, 
results of Erlich et al. (1997) and Lee et al. (1995) yield n(fc — 1) as lower bound. 
The following theorem is stronger not only in the value of the bound, but also 
in the assumptions that hold: The points of the shattered set all have the same 
distance from the origin, the centers of the RBF units are rational numbers, and 
the width can be chosen arbitrarily small. 

Theorem 3. Let n > 2, k > 1, and p > 0 be given. There exists a set P C 
<5(0, p) C ]R" of nk + 1 points and a real number ao > 0 so that P is shattered 
by the class of functions computed by the RBF network with k hidden units, 
positive rational centers, and any width 0 < ct < (Tq. 



Proof. Suppose that B{ci,rh), . . . , B{ck,r]k) are pairwise disjoint balls with 
positive centers Ci,...,Cfe € IR” such that, for i = l,...,k, the intersection 
S{ci,r]i) n 5(0, p) is non-empty and not a single point. (An example for n = 2 
and fc = 3 is shown in Fig. 1.) For i = 1, ..., k, let Pi C S{ci, iji) fl 5(0, p) be a 
set of n points in general position for hyperspheres. (Note that Pi is constrained 
to lie on two different hyperspheres. This still allows to choose Pi in general 
position since Pi contains n (and not n+ 1) points, so that the set of possible 
centers for Pi yields a line.) Further, let s G 5(0, p) be some point such that 
s ^ B{ci, Pi), for 7 = 1,. . . ,k. We claim that the set P = {s} U Pi U • • • U P^, 
which has nk + I points, is shattered by the RBF network with the postulated 
restrictions on the parameters. 

Assume that {P~,P'^) is some arbitrary dichotomy of P where s € P~ . 
(We will argue at the end of the proof that the complementary case can be 
treated by reversing signs.) Let (P~,P^) denote the dichotomy induced on Pi. 
By construction, every Pi satisfies 



Pi C S{ci, pf) and P n B{ci, pi) = Pi. 
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Fig. 1. The points of the shattered set are chosen from the intersections of the hyper- 
sphere S{0,p) with the surfaces of pairwise disjoint balls B{ci,r]i). All balls have their 
centers in the positive orthant. There is one additional point s not contained in any of 
the balls 



Hence by Lemma 2, instantiating the set Q with Pi and the set R with , it 
follows that there exist positive centers di and radii Ci such that 

Pt C S{d,, C.) and P n C.) = Pt, 

for i = 1, . . . ,k. Moreover, the centers di can be replaced by rational centers di 
that are sufficiently close to di, such that every point of P lying outside the ball 
B{di, (i) is outside the ball B{di, Q) for some Ci G IR close to Q, and every point 
of P lying on the hypersphere S{di,(i) is contained in the ball B{di,Q). Thus, 
every p G P satisfies 



p G B{di, Ct) if and only if p G P^ , (12) 

for i = 1, . . . ,k. Clearly, since the centers di are positive, the rational centers di 
can be chosen to be positive as well. 

The parameters of the RBF network are specified as follows: The i-th unit 

is associated with the ball B{di, Q). Assigned to it is di as the center and as 

~ 2 

output weight the value exp(Ci /ct^) (where a will be determined below) so that 
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the unit contributes the term 




J 



to the computation of the network. From assertion (12) we obtain that every 
p (z P\ satisfies the constraint 

||p- J*|| > Ci. 

Thus, for every sufficiently small ct > 0 and every p £ P \ P^ , we achieve that 



exp 



( 1 
Ct2 J k 



(13) 



is valid for i = 1, . . . ,k. On the other hand, for every p £ P^ condition (12) 
implies 

\\p-d^\\ < Q, 

which entails 




for every cr > 0. Finally, we set the bias term equal to —1. It is now easy to see 
that the dichotomy (P“,P+) is induced by the parameter settings: If p G P~ 
then, according to inequality (13), the weighted output values of the units and 
the bias sum up to a negative value. In the case p £ P~^ we have p £ P^ for 
some i and, by inequality (14), the weighted unit i outputs value of at least 1, 
while the other units output positive values, so that the total network output is 
positive. 

The construction for the case that classifies s as positive works similarly. 
We invoke Lemma 2 substituting P~ for R and derive the analogous version of 
assertion (12) with P^ replaced by P~ . Then it is obvious that, if the weights 
defined above are equipped with negative signs and 1 is used as the bias, the 
network induces the dichotomy as claimed. 

We observe that a may have been chosen such that it depends on the par- 
ticular dichotomy. To complete the proof, we require (Tq to be small enough so 
that inequality (13) holds for ct < uo on all points and dichotomies of P. □ 

We remark that one assumption of the theorem can be slightly weakened: It 
is not necessary to require that s £ 5'(0,p). Instead, every point not contained 
in any of the balls B{ci,rii) can be selected for s. However, the restriction is 
required for the application of the theorem in the following result, which is the 
main contribution of this paper. For its proof we recall the definition of a product 
unit neural network in Section 2. 
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Theorem 4. For every n,k> 1, the VC dimension of the class of k-sparse 
polynomials in n variables is at least nk + 1. 



Proof. We first consider the case n > 2. By Theorem 3, for p > 0 let P C ]R”, 
P C S{0,p), be the set of cardinality nk + 1 that is shattered by the RBF 
network with k hidden units and the stated parameter settings. We show that 
P can be transformed into a set P' that is shattered by /c-sparse polynomials. 
The weighted output computed by unit i in the RBF network on input p G P 
can be written as 



Wi ■ exp 



IIP - c* 
^2 



Wi ■ exp 



||pf-2pc, + ||c,f\ 

J 



= Wi ■ exp 



llPlI 



C,; 



exp 



2pcj 



= Wi ■ exp - 



■ exp 






' • exp 



‘^PnCi 



where we have used the assumption P C 5'(0,p) for the last equation, and 
Pj, Cij to denote the j-th components of the vectors p, Ci, respectively. Consider 
a product unit network with one hidden layer, where unit i has output weight 



Wi = Wi ■ exp — 



and exponents 2cij for j = 1, . . . , n. On the set 

P' = {(ePC...,eP»):(pi,...,p„)G^} 

this product unit network computes the same values as the RBF network on P. 
Moreover, the exponents of the product units are positive rationale. According 
to Theorem 3, for some (Tq, any width 0 < a < ag can be used. Therefore, we 
may choose = 1/1 for some natural number I that is sufficiently large and a 
common multiple of all denominators occurring in any Cij, so that the exponents 
become integers. With these parameter settings, we have a fc-sparse polynomial 
that computes on P' the same output values as the RBF network on P. As this 
can be done for every dichotomy of P, it follows that P' is shattered by fc-sparse 
polynomials. 

For the case n = 1, we again use the RBF technique and ideas from Schmitt 
(2002a,2004). Clearly, the set M = {0, ...,/c} can be shattered by an RBF 
network with k + 1 hidden units and zero bias: For each i G M we employ an 
RBF unit with center i; given a dichotomy (M“, M+), we let the output weight 
for unit i be — 1 if z G M~ , and 1 if z G M+. If the width a is small enough, the 
output value of the network has the requested sign on every input z G M . Now, 
let a be the smallest width sufficient for all dichotomies of M. Then 



Wo exp ( “ ^ ) + ^"1 exp 



{x - 1)^ 



{x — k)^' . 

+ Wk exp ^ — 1 > 0 
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is, by multiplication with exp(x^/cr^), equivalent to 
Wo + wi exp 



2a;- 1 



Wk exp 



2kx — 

(T^ 



The latter can be written as 



/ 2a; 



> 0 . 



/ 2kx\ 



Wo + WiBxp hWfcexp J - 

Substituting y = exp(2a;/cr^), this holds if and only if 

wo + wiexp[--^jy-\ h Wfe exp I 1 y >0. 

Thus, for every dichotomy of M we obtain a dichotomy of M' = : i = 

0, . . . , fc} induced by a fc-sparse polynomial. In other words, M' is shattered by 
this function class. □ 



5 Comparison with fc-Term Monotone DNF 

A Boolean formula that is a disjunction of up to k monomial terms without nega- 
tions can be considered as a /c-sparse polynomial restricted to Boolean inputs. 
The previously best known lower bound for the VC dimension of fc-sparse poly- 
nomials was the bound for fc-term monotone DNF due to Littlestone (1988). By 
deriving an upper bound for the latter class and applying Theorem 4, we show 
that the VC dimension for /c-sparse polynomials is strictly larger than the VC 
dimension for fc-term monotone DNF. We use “log” to denote the logarithm of 
base 2. 

Corollary 5. Let n > 1 and 3 < A: < 2". The VC dimension of the class of 
k-sparse n-variate polynomials exceeds the VC dimension of the class of k-term 
n-variate monotone DNF by more than k\og{k/e) + 1. 

Proof. A fc-term monotone DNF formula corresponds to a collection of up to k 
subsets of the set of variables. For n variables, there are no more than Ci ) 
such collections. The known inequality (T) ^ {em/dY, where 1 < d < m, 
(see, e.g., Anthony and Bartlett, 1999, Theorem 3.7) yields 




By definition, the VC dimension of a finite function class T cannot be larger than 
log \T\. Hence, the VC dimension for A:-term monotone DNF is less than nk — 
klog{k/e). Theorem 4 implies that this bound falls short of the VC dimension 
for /c-sparse polynomials by at least klog{k/e) -1-1. □ 
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It is easy to see that in the cases k = 1,2, which are not covered by Corol- 
lary 5, the VC dimension of fc-sparse polynomials is larger as well. First, as there 
are no more than 2” Boolean monotone monomials, the VC dimension of mono- 
tone monomials is at most n. Second, the number of monotone DNF formulas 
with at most two terms is not larger than 2^" -|- 1, and log(2^” -I- 1) is less than 
‘2n 1. 

6 Conclusion 

We have derived a new lower bound for the VC dimension of sparse multivariate 
polynomials. This bound is stronger and holds for a wider class of polynomials 
than the previous bound via Boolean formulas in monotone DNF. Moreover, 
it follows that the VC dimension for fc-sparse polynomials exceeds the VC di- 
mension for /c-term monotone DNF. Therefore, the techniques that use DNF 
formulas for deriving lower bounds on the VC dimension of sparse polynomials 
seem to have reached their limits. 

We have introduced a method that accomplishes dichotomies of sets by poly- 
nomials via Gaussian RBF networks. At first view, the Gaussian RBF network 
appears to be more powerful than a polynomial, provided both have the same 
number of terms: Each parameter of a Gaussian RBF network may assume any 
real number, whereas the polynomial must have exponents that are nonnegative 
and integers. Nevertheless, we have shown here that RBF networks can be used 
to establish lower bounds on the computational capabilities of sparse multivari- 
ate polynomials. While the previous lower bound method via monotone DNF 
formulas gives rise to monomials with exponents not larger than 1, the approach 
that uses RBF networks shows that and how large exponents can be employed to 
shatter sets of a cardinality larger than known before. Moreover, the construc- 
tions give reason to a completely new interpretation of the exponent vectors 
when polynomials are used for classification tasks: They have been chosen as 
centers of balls. This perspective might open a new approach for the design of 
learning algorithms that use sparse multivariate polynomials as hypotheses. 

The result of this paper narrows the gap between lower and upper bound 
for the VC dimension of sparse multivariate polynomials. As the bounds are not 
yet tight, it is to be hoped that the method presented here may lead to further 
insights that possibly yield additional improvements. 
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Abstract. For hyper-rectangles in Auer et al. [1] proved a PAC 
bound of O (j(d -t log j)), where e and 5 are the accuracy and con- 
fidence parameters. It is still an open question whether one can ob- 
tain the same bound for intersection-closed concept classes of VC- 
dimension d in general. We present a step towards a solution of this 
problem showing on one hand a new PAC bound of O (-(dlogd-l- |)) 
for arbitrary intersection-closed concept classes complementing the well- 
known bounds O (^(log | -|- dlog i)) and O (| log |) of Blumer et al. 
and Haussler et al. [4,6]. Our bound is established using the closure 
algorithm, that generates as its hypothesis the smallest concept that 
is consistent with the positive training examples. On the other hand, 
we show that maximum intersection-closed concept classes meet the 
bound of O [^{d + log j)) as well. Moreover, we indicate that our new 
as well as the conjectured bound cannot hold for arbitrary consis- 
tent learning algorithms, giving an example of such an algorithm that 
needs l?(i(dlogi -flog j)) examples to learn some simple maximum 
intersection-closed concept class. 



1 Introduction 

In the PAC model a learning algorithm generalizes from given examples to a hy- 
pothesis that approximates a target concept taken from a concept class known to 
the learner. The learning algorithm A then PAC learns a concept class if for e, S 
there is an m = m{e,S), such that with probability at least 1 — <5 the algorithm 
outputs a hypothesis with accuracy > e when m random examples are given to 
A. Bounds on m usually depend on the VC-dimension, a combinatorial param- 
eter of the concept class. For finite d the well-known bound of Blumer et al. [4] 
states that for any consistent learning algorithm O (y(log j -I- dlog j)) examples 
suffice for PAC learning concept classes of VC-dimension d. On the other hand, 
for the 1-inclusion graph algorithm a bound of O (| log |) was established in 
[6]. In this paper we give a complementing bound of O (i(dlog d -I- |)) when 
learning intersection-closed concept classes (see e.g. [1,2,7]) with the closure al- 
gorithm. Intersection-closed concept classes include quite natural classes such as 
hyper-rectangles in or the class of all subsets of some finite X with < d ele- 
ments. For these concrete intersection-closed concept classes an optimal bound 
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of O (j((i -I- log D) can be shown (see [3] and Sect. 4 below, resp.). It is an 
open problem whether this optimal bound holds for intersection-closed concept 
classes in general. If so, it can be achieved only for special learning algorithms 
since there are consistent learning algorithms that need 12 (i(dlog ^ -I- log |)) 
examples to learn some intersection-closed concept classes (see Sect. 4 below). 

2 Preliminaries 

2.1 Intersection-Closed Concept Classes 

A concept class over a (countable) set A is a subset C C 2^. For F C A we set 
C (lY :={CnF|CG C}. The VC-dimension of a concept class C C 2^ is the 
cardinality of a largest F C A for which C fl F = 2^ . 

Definition 1. A concept class C C 2^ is intersection-closed if for all Ci,C 2 G 
C: CinCa GC. 

For any set F C A and any concept class C C 2^ we define the closure of Y 
(with respect to C) as close (F) := flycceC C. If it is clear to which concept class 
we refer we often drop the index and write clos(F). The following proposition 
provides an alternative definition of intersection-closed concept classes. 

Proposition 2. A concept class C C 2^ is intersection-closed if and only if for 
Y C C € C one always has closfY) G C. 

Proof. First, it is clear by definition that clos(F) G C for intersection-closed 
C. Now suppose that for F C C € C one always has clos(F) G C and let 
Ci,C 2 G C. Then because of Ci fl C 2 C Ci,C 2 we have by definition of the 
closure, clos(C'i fl C 2 ) Q Ci fl C 2 . On the other hand, Ci fl C 2 C clos(C'i fl C 2 ), 
so that Cl n C 2 = clos(Ci fl C 2 ) G C. □ 

Again, let F C A. A spanning set of Y (with respect to an intersection- 
closed concept class) is any set S' C F such that clos(S) = clos(F). A spanning 
set S of F is called minimal if there is no spanning set S' of F with |S'| < |S|. 
Finally, let span^(F) denote the set of all minimal spanning sets of F. Again 
we will often drop the index if no ambiguity can arise. The following theorem 
mentions a key property of intersection-closed concept classes (for a proof we 
refer to [7]). 

Theorem 3. All minimal spanning sets of some Y Q X in an intersection- 
closed class C C 2^ have size at most VC-Dim{C). 

Furthermore, we shall need the following well-known theorem. 

Theorem 4 (Sauer’s Lemma[9]). Let C Q 2^ be a concept class of VC- 
dimension d. Then 
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2.2 Learning 

Learning a concept C G C means learning the characteristic function Ic on 
X. Thus the learner outputs a hypothesis h : X ^ {0, 1}. Given a probability 
distribution V on X the error of the hypothesis h with respect to C and V is 
defined as erc,v{h) := P({x | h{x) yf lc(a;)})- 

Definition 5. A concept class C Q 2^ is called PAC learnable if for all e,S G 
(0, 1), all probability distributions V on X and all C G C there is an m = m{e, S) 
such that when learning C from m randomly chosen examples according to V and 
C the output hypothesis h has erc,v{h) > e with probability < 6 with respect to 
the m examples drawn independently according to V. 

3 A New PAC Bound 

The property mentioned in Theorem 3 can be used together with Sauer’s Lemma 
to modify the original proof of the bound of O (j(log | + dlog ^)) for arbitrary 
concept classes by Blumer et al. [4] to obtain the following alternative bound. 

Theorem 6. Let C C 2^ be a well-behaved^ intersection- closed concept class of 
VC-dimension d > 10. Then C is PAC learnable from 

r 16 „ , 6 , 28 I 

m > max < — d log a, - log — > 

I e e 0 I 



examples. 

The main step of the mentioned proof is the so-called “doubling trick” (for de- 
tails see [4], p.952ff): One chooses 2m (labelled) examples (xi, yi), . . . , (x 2 m, 2 / 2 m) 
and counts the number of permutations such that the hypothesis calculated from 
the first m examples misclassisfies at least p of the second m examples. Then 
choosing p = \em/2'\ one obtains the bound. In the following we give an im- 
proved bound for the number of permutations for intersection-closed concept 
classes. 

Unlike in the original proof we are going to use a special learning algorithm, 
namely the closure algorithm. Given a set of labelled examples (xi,yi), . . . , 
{xm,ym) with labels yi G {0,1} the hypothesis generated by the closure algo- 
rithm is the smallest concept C G C that is consistent with the positive examples, 
that is, the examples with = 1. It is easy to see that this concept is identi- 
cal to the closure of {xi | i/i = 1, 1 < f < to}. Thus, negative examples don’t 
have any influence on the generated hypothesis. Moreover we have the following 
proposition. 

Proposition 7. The closure algorithm classifies all negative examples correctly. 

^ The usual measurability conditions on certain sets turning up in the proof of Lemma 
9 below have to be satisfied (for a detailed discussion see [4], p.952ff). However, we 
remark that concept classes over finite X are always well-behaved. 
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Proof. The algorithm returns the smallest concept that is consistent with the 
positive examples. Consequently, if it classified any negative example incorrectly 
there wouldn’t be any concept in C that is consistent with the given examples. 

□ 

Hence, according to Proposition 7, any incorrectly classified example among 
{xm+i,ym+i), ■ ■ ■ , {x 2 m, J/ 2 m) must be positive. Thus when counting the number 
of the aformentioned permutations we can confine ourselves to positive examples. 
Let i be the number of positive examples among (xi,yi), . . . , {x 2 m,y 2 m)- We 
define recursively sets Xi and S'i for i = 1, . . . , £, where X\ := {xi\yi = 1,1 < i < 
2m\ is the set of positive examples. Si is an arbitrary element of span(Aii) and 
for i > 1 we set Xi := Xi-i \ Si-\. Now for each Xi that contains misclassified 
examples there must be at least one misclassified example in the corresponding 
spanning set Si as well. Thus removing Si from Xi at least one misclassified 
example is removed, which leads to the following proposition. 

Proposition 8. If there are k incorrectly classified examples among the Xi, . . . , 
X 2 m they are in IJiLi Si- 

Proof. By Proposition 7, misclassified examples must be in X\. Now suppose 
there is a wrongly classified example that is not in Since the Si are 

disjoint it follows that there is an S'ij, that does not contain any misclassified 
example. Thus, all examples in Si^ and consequently all examples in Xi^ are 
classified correctly. But this is only possible if all the k misclassified examples 
have been removed before, so that they have to be contained in Si C 
U^=i Si, which contradicts our assumption. □ 

Lemma 9. Let C C 2^ be a well-behaved intersection-closed concept class of 
VC-dimension d, V be a probability distribution on X and the target concept C 
be a Borel set C X. Then for all e > 0 and for all m > 2/e, given m independent 
random examples labelled by C and drawn according to V , the probability that 
the hypothesis h generated by the closure algorithm has error erc^v{h) > s is at 
most 




where p = |"em/2] . 

Proof. As mentioned before, the proof follows the main lines of [4], pp.952ff. 
However, our equivalent to Lemma A2.2 looks a bit different. Concerning the 
number of witnesses, i.e. the sets of wrongly classified examples, in the proof 
of Lemma A2.2 we need not consider II {2m), the number of all subsets of 
{xi, . . . , X 2 m} that are induced by intersections with concepts in C. Instead, 
according to Proposition 8, it is sufficient to consider the corresponding subsets 
of Ufci Si lor k = p, ... ,m. By Theorem 3, | lj^=i Si\ < kd so that by Sauer’s 
Lemma the number of these subsets for fixed k is at most . Summing up 
over all fc G {p, . . . , m} the result follows analogously to the proofs of Lemma 
A2.2 and Theorem A2.1 in [4]. □ 
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Lemma 10. If d> 10 and 

r 16 „ , 6 , 28 1 , kd\ 

m > max — c?log d, - log — > then 2 ^ 2 ( ^ j < 5 , 

^ k—p ^ ^ 

where p = |"em/2] . 

Proof. First, we are going to use Proposition A2.1 (iii) of [4], which tells us that 
for k,d> 1 one has 

{<4) ^ ■ <'> 



It is easy to check that for c? > 10 and k > 8 dlog d it holds that 
f or (et)- < 2‘/^ 

Hence for k > Sdlogd we have from (1) and (2) 




< 2 ^2-'=(e/c)^ < 

k—p 



m 



2 ^ 2“'=/^ 



< 2 • 2-P/2 — ^ . 

2-^2 



(2) 



Setting K := and substituting p = [em/2] it is easy to see that for 

w > I log ^ log ^ one has K ■ 2“^’/^ < 6, which finishes the proof. □ 

Proof of Theorem 6. The theorem follows immediately from Lemmata 9 and 10. 

□ 



4 Maximum Intersection-Closed Classes 



A concept class C over finite X is called maximum (cf. [5]), if it meets the bound 
of Sauer’s Lemma (Theorem 4 above), that is, if \C\ = An example of a 

maximum (and intersection-closed) concept class of VC-dimension d is Cx,d ■= 
{CCA : |C| < d}, the class of all < d-subsets of A. 

This time adapting the proof of bound of O (i(d -I- log j)) for hyper- 
rectangles in [3] we show that the closure algorithm learns maximum intersection- 
closed concept classes from O (i(d -I- log |)) examples as well. 

Theorem 11. Let C he a maximum intersection-closed concept class of VC- 
dimension d over finite A. Then C is PAC learnahle from 



m > 



5(,i + logl) 



examples. 



For the proof of Theorem 11 we will use the following key property of maxi- 
mum classes (for a proof we refer to [5]). 
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Theorem 12 (Welzl 1987). Let C C 2^ be a maximum concept class of VC- 
dimension d over finite X. Then for any x G X the concept class ■= {C € 
C\x & C} is maximum of VC-dimension d — 

Corollary 13 (Welzl 1987). Let C C 2^ he a maximum concept class of VC- 
dimension d over finite X. Then for any d-subset Y C X the concept class 
Cy ■= {C & C\Y C C} has VC-dimension 0 and hence consists of a single 
element. 

Proof of Theorem 11. As mentioned before we follow the main lines of the proof 
of Theorem 7 in [3], pp.381ff. We only have to argue that Lemma 10 of [3] holds 
in our case as well. This time we have to count the number of possibilities to 
choose m from m-\-p examples (xi,yi), . . . , (xm-i-p, Vm+p) such that the hypoth- 
esis calculated from these m examples misclassifies the p remaining examples. 
Obviously, we may consider the concept class C = C C\ {xi, . . . ,Xm+p} instead 
of C itself. Thus, we will show that the number of concepts in C that misclassify 
exactly p examples among (xi, j/i), . . . , {xm+p, Vm-\-p) is < • Then choosing 

p = [log j], the theorem follows analogously to [3]. 

Again using the closure algorithm, only the positive examples Xi = {xi \ yi = 
1,1 < i Cl m -\- p'\ are relevant for hypothesis calculation and evaluation as 
well. We assume that none of the positive examples occurs more than once 
among (xi, ),..., (xm+p, 7/m-i-p)- Otherwise the number of partitions will be 
even smaller. 

Now we want to encode the concepts in C according to their classification of 
the examples in X\. To this end we impose an arbitrary but fixed order on the 
elements of Xi. Each concept C G C' is then encoded as a word in {0, as 

follows: a 1 on the j-th position means that C classifies Xi^ correctly, while a 0 
indicates that Xi^ is misclassified by C. Being interested only in concepts that 
misclassify exactly p examples of Xi we need only consider the first d-\-p letters 
of the code words. First, it is clear that there cannot occur more than p 0-entries 
in the code word corresponding to such a concept. On the other hand, if there 
are > d 1-entries in a code word w, according to Corollary 13 there is only one 
concept in C that corresponds to w. Thus, the number of concepts is bounded 
above by the number of code words consisting of p 0-entries and d 1-entries. The 
latter is equal to ("^p^), which finishes our proof. □ 

The following example shows that for the new bounds in this paper, the 
choice of the learning algorithm is essential. Consider Cx,d, the class of all < d- 
subsets of X, and an algorithm that chooses as its hypothesis not the smallest 
concept consistent with the given examples (as the closure algorithm does) but 
an arbitrarily chosen largest consistent concept. We claim that this algorithm 
needs C (^(dlog ^ -|- log j)) examples to learn Cx,d- First we show a lower bound 
of 17 (-log-). Let X consist of n := [-] elements and V be the uniform dis- 
tribution on X. When learning the target concept 0 G Cx,d the error of the 
algorithm’s hypothesis is < e only if at least n — {d — 1) distinct examples 
appear among the training examples. The probability that a certain example 
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is not among the m training examples is (l — Let Z be a random vari- 

able denoting the number of examples in X that are not in the training set. 
Thus, /X := E{Z) = n (l — Note that Z is binomially distributed, so that 
P{Z > [yxj) > 5 (cf. Appendix B of [8]) . If m = f log j then for small | 
(so that log(l — |) « — |) we have /x « d. Since P{Z > [/xj) > | it follows 
that at least 17 (| log examples are needed to learn Cx,d- Note that for an- 
other suitable distribution on X (cf. [4] for details) one obtains the well-known 
lower bound of 17 (- log 4), so that altogether this establishes a lower bound of 
17(i(dlogi+logi)). 

5 Final Remarks 

The extension of our result for maximum intersection-closed concept classes 
to intersection-closed concept classes in general seems to be far from trivial. 
For hyper-rectangles in the given topological structure allows to obtain 
the conjectured bound, while for maximum intersection-closed concept classes 
the result of Welzl provides a similar structure that can be used. However, for 
arbitrary intersection-closed concept classes it seems to be hard to impose some 
kind of structure that is sufficient to obtain the desired bound. Our Proposition 
8 is obviously not strong enough. Thus, we think that some combinatorial key 
result will be needed to make further progress. 
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Abstract. We consider a framework in which the clustering algorithm 
gets as input a sample generated i.i.d by some unknown arbitrary dis- 
tribution, and has to output a clustering of the full domain set, that is 
evaluated with respect to the underlying distribution. We provide general 
conditions on clustering problems that imply the existence of sampling 
based clusterings that approximate the optimal clustering. We show that 
the A-median clustering, as well as the Vector Quantization problem, 
satisfy these conditions. In particular our results apply to the sampling - 
based approximate clustering scenario. As a corollary, we get a sampling- 
based algorithm for the A-median clustering problem that finds an al- 
most optimal set of centers in time depending only on the confidence 
and accuracy parameters of the approximation, but independent of the 
input size. Furthermore, in the Euclidean input case, the running time 
of our algorithm is independent of the Euclidean dimension. 



1 Introduction 

We consider the following fundamental problem: 

Some unknown probability distribution, over some large (possibly in- 
finite) domain set, generates an i.i.d. sample. Upon observing such a 
sample, a learner wishes to generate some simple, yet meaningful, de- 
scription of the underlying distribution. 

The above scenario can be viewed as a high level definition of unsupervised 
learning. Many well established statistical tasks, such as Linear Regression, Prin- 
ciple Component Analysis and Principal Curves, can be viewed in this light. In 
this work, we restrict our attention to clustering tasks. That is, the description 
that the learner outputs is in the form of a finite collection of subsets (or a 
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partition) of the domain set. As a measure of the quality of the output of the 
clustering algorithm, we consider objective functions defined over the underlying 
domain set and distribution. 

This formalization is relevant to many realistic scenarios, in which it is nat- 
ural to assume that the information we collect is only a sample of a larger body 
which is our object of interest. One such example is the problem of Quantizer De- 
sign [2] in coding theory, where one has to pick a small number of vectors, ‘code 
words’, to best represent the transmission of some unknown random source. 

Results in this general framework can be applied to the worst-case model of 
clustering as well, and in some cases, yield significant improvements to the best 
previously known complexity upper bounds. We elaborate on this application in 
the subsection on worst-case complexity view below. 

The paradigm that we analyze is the simplest sampling-based meta- 
algorithm. Namely, 

1. Draw an i.i.d random sample of the underlying probability 
distribution. 

2. Find a good clustering of the sample. 

3. Extend the clustering of the sample to a clustering of the 
full domain set. 

A key issue in translating the above paradigm into a concrete algorithm is 
the implementation of step 3; How should a clustering of a subset be extended 
to a clustering of a full set? For clusterings defined by a choice of a fixed num- 
ber if centers, like the K median problem and vector quantization, there is a 
straightforward answer; namely, use the cluster centers that the algorithm found 
for the sample, as the cluster centers for the full set. While there are ways to 
extend clusterings of subsets for other types of clustering, in this paper we focus 
on the A'-median and vector quantization problems. 

The focus of this paper is an analysis of the approximation quality of sampling 
based clustering. We set the ground for a systematic discussion of this issue in 
the general context of statistical clustering, and demonstrate the usefulness of 
our approach by considering the concrete case of A'-median clustering. 

We prove that certain properties of clustering objective functions suffice to 
guarantee that an implicit description of an almost optimal clustering can be 
found in time depending on the confidence and accuracy parameters of the ap- 
proximation, but independent of the input size. We show that the AT-median 
clustering objective function, as well as the vector quantization cost, enjoy these 
properties. We are therefore able to demonstrate the first known constant-time 
approximation algorithm for the A'-median problem. 

The paradigm outlined above has been considered in previous work in the 
context of sampling based approximate clustering. Buhmann [3] describes a sim- 
ilar meta-algorithm under the title ’’Empirical Risk Approximation”. Buhmann 
suggests to add an intermediate step of averaging over a set of empirically good 
clusterings, before extending the result to the full data set. Such a step helps 
reduce the variance of the output clustering. However, Buhmann’s analysis is 
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under the assumption that the data- generating distribution is known to the 
learner. We address the distribution free (or, worst case) scenario, where the 
only information available to the learner is the input sample and the underlying 
metric space. 

Our main technical tool is a uniform convergence result that upper bounds, 
as a function of the sample sizes, the discrepancy between the empirical cost of 
certain families of clusterings to their true cost (as defined by the underlying 
probability distribution). Convergence results of the empirical estimates of the 
fc-median cost of clusterings where previously obtained for the limiting behavior, 
as sample sizes go to infinity (see, e.g. Pollard [6]). Finite-sample convergence 
bounds where obtained for the fc-median problem by Mishra et al [5] , and for the 
vector quantization problem by Bartlett et al [2], which also provide a discus- 
sion of vector quantization in the context of coding theory see [2]. Smola et al [7] 
provide a framework for more general quantization problems, as well as conver- 
gence results for a regularized versions of these problems. However, the families 
of cluster centers that our method covers are much richer than the families of 
centers considered in these papers. 



1.1 Worst- Case Complexity View 

Recently there is a growing interest in sampling based algorithms for approxi- 
mating NP-hard clustering problems (see, e.g, Mishra et al [5], de la Vega et al 
[8] and Meyerson et al [4]). In these problems, the input to an algorithm is a 
finite set V in a metric space, and the task is to come up with a clustering of X 
that minimizes some objective function. The sampling based algorithm performs 
this task by considering a relatively small S Q X that is sampled uniformly at 
random from X, and applying a (deterministic) clustering algorithm to S. The 
motivating idea behind such an algorithm is the hope that relatively small sam- 
ple sizes may suffice to induce good clusterings, and thus result in computational 
efficiency. In these works one usually assumes that a point can be sampled uni- 
formly at random over X in constant time. Consequently, using this approach, 
the running time of such algorithms is reduced to a function of the size of the 
sample (rather than of the full input set X) and the computational complexity 
analysis boils down to the statistical analysis of sufficient sample sizes. 

The analysis of the model proposed here is relevant to these settings too. By 
taking the underlying distribution to be the uniform distribution over the input 
set X , results that hold for our general scenario readily apply to the sampling 
based approximate clustering as well. 

The worst case complexity of sampling based RT-median clustering is ad- 
dressed in Mishra et al [5] where such an algorithm is shown to achieve a sub- 
linear upper bound on the computational complexity for the approximate K- 
median problem. They prove their result by showing that with high probability, 
a sample of size O suffices to achieve a clustering with average cost 

(over all the input points) of at most 20pt + e (where Opt is the average cost 
of an optimal k clustering). By proving a stronger upper bound on sufficient 
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sample sizes, we are able to improve these results. We prove upper bounds on 
the sufficient sample sizes (and consequently on the computational complexity) 
that are independent of the input size n. 



2 The Formal Setup 

We start by providing a definition of our notions of a statistical clustering prob- 
lem. Then, in the ’’basic tool box’ subsection, we define the central tool for this 
work, the notion of a clustering description scheme, as well as the properties of 
these notions that are required for the performance analysis of our algorithm. 
Since the generic example that this paper addresses is that of itl-median clus- 
tering, we shall follow each definition with its concrete manifestation for the 
if-median problem. 

Our definition of clustering problems is in the spirit of combinatorial opti- 
mization. That is, we consider problems in which the quality of a solution (i.e. 
clustering) is defined in terms of a precise objective function. One should note 
that often, in practical applications of clustering, there is no such well defined 
objective function, and many useful clustering algorithms cannot be cast in such 
terms. 

Definition 1 (Statistical clustering problems). 

— A clustering problem is defined by a triple {X,T,R), where X is some do- 
main set (possibly infinite), T is a set of legal clusterings (or partitions) of 
X, and R : V X T i— [0, 1] is the objective function (or risk) the clustering 
algorithm aims to minimize, where V is a set of probability distributions over 
X b 

— For a finite S C X, the empirical risk of a clustering T on a sample S, 
R{S, T), is the risk of the clustering T with respect to the uniform distribution 
over S. 

— For the it'-median problem, the domain set X is endowed with a metric d 
and T is the set of all fc-cell Voronoi diagrams over X that have points of 
X as centers. Clearly each T £ T is determined by a set {xj , . . . x'^} C X, 
consisting of the cell’s centers. Finally, for a probability distribution P over 
X, and T £ T, R{P,T) = E^gp (minjg^i d{y,xj)). That is, the risk of a 
partition defined by a set of k centers is the expected distance of a P-random 
point from its closest center. 

Note that we have restricted the range of the risk function, R to the unit 
interval. This corresponds to assuming that, for the X-median and vector quan- 
tization problems, the data points are all in the unit ball . This restriction allows 

^ In this paper, we shall always take V to be the class of all probability distributions 
over the domain set, therefore we do not specify it explicitly in our notation. There 
are cases in which one may wish to consider only a restricted set of distributions (e.g., 
distributions that are uniform over some finite subset of X) and such a restriction 
may allow for sharper sample size bounds. 
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simpler formulas for the convergence bounds that we derive. Alternatively, one 
could assume that the metric spaces are bounded by some constant and adjust 
the bounds accordingly. On the other extreme, if one allows unbounded metrics, 
then it is easy to construct examples for which, for any given sample size, the 
empirical estimates are arbitrarily off the true cost of a clustering. 

Having defined the setting for the problems we wish to investigate, we move 
on to introduce the corresponding notion of desirable solution. The definition of 
a clustering problem being ’approximable from samples’ resembles the definition 
of learnability for classification tasks. 

Definition 2 (Approximable from samples). A clustering problem 
(A, T, R) is a - approximable from samples, for some a > 1, if there exist 
an algorithm A mapping finite subsets of X to clusterings in T , and a function 
f : (0, 1)^ I— >■ N, such that for every probability distribution P over X and every 
e, S € (0,1), if a sample S of size > f{e,S) is generated i.i.d. by P then with 
probability exceeding 1 — <5, 

R{P,A{S)) < min ai?(P,T) + e. 

Note that formally, the above definition is trivially met for any fixed finite 
size domain X. We have in mind the setting where X is some infinite uni- 
versal domain, and one can embed in it finite domains of interest by choosing 
the underlying distribution P so that it has that set of interest as its support. 
Alternatively, one could consider a definition in which the clustering problem is 
defined by a scheme {(A„, 7)i, i?„)}„gN and require that the sample size function 
f{e,6) is independent of n. 



2.1 Our Basic Tool Box 

Next, we define our notion of an implicit representation of a clustering. We call it 
a clustering description scheme. Such a scheme can be thought of as a compact 
representation of clusterings in terms of sets of I elements of X, and maybe some 
additional parameters. 

Definition 3 (Clustering description scheme). 

Let (A, T, R) be a clustering problem. An {I, I) clustering description scheme 
for (A, T, R) is a function, G : A^ x J i— >■ T, where I is the number of points a 
description depends on, and I is a set of possible values for an extra parameter. 

We shall consider three properties of description schemes. The first two can, 
in most cases, be readily checked from the definition of a description scheme. 
The third property has a statistical nature, which makes it harder to check. 
We shall first introduce the first two properties, completeness and localization, 
and discuss some of their consequences. The third property, coverage, will be 
discussed in Section 3 . 
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Completeness: A description scheme, G, is Complete for a clustering problem 
{X, T, R), if for every T G T there exist xi, . . .xi £ X and i £ I such that 
G(xi,. ..xi,i) = T. 

Localization: A description scheme, G, is Local for a clustering problem 
(X,T,R), if there exist a functions / : x J i— >■ K such that for any 

probability distribution P, for dll x\, . . .xi £ X and i £ I, 

R{P,G{xi,...xi,i)) = Ey^pf{y,xi,...xi,i) 



Examples: 

The AT-median problem endowed with the natural description scheme: in this 
case, I = k (the number of clusters), there is no extra parameter i, and 
G{xi , . . . , Xk) is the clustering assigning any point y £ X its closest neighbor 
among {a;i, . . . , Xk}- So, given a clustering T, if {xj^ , . . . x|’} are the centers 
of T’s clusters, then T = G(x'[, . . . , x'^). Clearly, this is a complete and local 
description scheme (with f{y, x± . . . , Xk) = miujg^i d{y, Xi) and F being 
the identity function). 

Vector Quantization: this problem arises in the context of source coding. The 
problem is very similar to the AT-median problem. The domain X is the Eu- 
clidean space K'^, for some d, and one is given a fixed parameter 1. On an input 
set of d-dimensional vectors, one wishes to pick ’code points’ {xi,. . .xi) € 
and map each input point to one of these code points. The only difference 
between this and the A'-median problem is the objective function that one 
aims to minimize. Here it is R{P,Tx^^...x,) = Eygp [miujg^i d(?/, x^)^] . 
The natural description scheme in this case is the same one as in the K- 
median problem - describe a quantizer T by the set of code point (or centers) 
it uses. It is clear that, in this case as well, the description scheme is both 
complete and local. 

Note, that in both the AT-median clustering and the vector quantization 
task, once such an implicit representation of the clustering is available, the 
cluster to which any given domain point is assigned can be found from the 
description in constant time (a point y is assigned to the cluster whose index is 
Argmini(,{i^„,k}d{y,Xi)). 

The next claim addresses the cost function. Let us fix a sample size m. Given 
a probability distribution P over our domain space, let P™ be the distribution 
over i.i.d. m- samples induced by P. For a random variable f{S), let Asg pm (/) 
denote the expectation of / over this distribution. 

Claim 1. Let {X,T,R) he a clustering problem. For T £ T , if there exists a 
function hr '■ X ^ R+ such that for any probability distribution P, R{P, T) = 
Ex^p{hT{x)) , then for every such P and every integer m, 



Es^P^{R{S,T)) = R{P,T) 
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Corollary 2. If a clustering problem (X,T,R) has a local and complete de- 
scription scheme then, for every probability distribution P over X, every m> 1 
and every T G T, 

EseP^{R{S,T)) = R(P,T) 



Lemma 1. If a clustering problem {X,T,R) has a local and complete descrip- 
tion scheme then, for every probability distribution P over X, every m > 1 and 
every T G T, 

P”^{\R{P,T) - R{S,T)\ >e}< 

The proof of this Lemma is a straightforward application of Hoeffding inequality 
to the above corollary (recall that we consider the case where the risk R is in 
the range [0, 1]). 

Corollary 3. If a clustering problem {X, T, R) has a local and complete de- 
scription scheme then, for every probability distribution P over X , and every 
clustering T G T, if a sample S C X of size m > is picked i.i.d. via P 

then, with probability > 1 — 5 ( over the choice of S ), 

\R{S,T) - R{P,T)\ < e 

In fact, the proofs of the sample-based approximation results in this paper require 
only the one-sided inequality, R{S, T) < R{P, T) -\- e. 

So far, we have not really needed description schemes. In the next theorem, 
claiming that the convergence of sample clustering costs to the true probabil- 
ity costs, we heavily rely on the finite nature of description schemes. Indeed, 
clustering description schemes play a role similar to that played by compression 
schemes in classification learning. 

Theorem 4. Let G be a local description scheme for a clustering problem 
(A", T, R) ■ Then for every probability distribution P over X, if a sample S C X 
of size m » I is picked i.i.d. by P then, with probability >1 — 5 (over the choice 
of S ), for every Xi, . . . xi G S and every i G I, 



\R{S,G{xi, . ..xi,i)) - R{P,G{xi, . ..xi,i))\ < 



ln(|/|) -\- llnm-G ln(l/5) 



2(r 



1 ) 



Proof. Corollary 3 implies that for every clustering of the form G{x\, ...xi,i), 
if a large enough sample S is picked i.i.d. by P, then with high probability, the 
empirical risk of this clustering over S is close to its true risk. It remains to 
show that, with high probability, for S sampled as above, this conclusion holds 
simultaneously for all choices of xi, .. .xi G S and all i G I. 

To prove this claim we employ the following uniform convergence result: 

Lemma 2. Given a family of clusterings {G{xi, ... xi,i)}xi,.. .ex iei> lete(jn,S) 
be a function such that, for every choice of x\, . . .xi,i and every choice ofm and 
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(5 > 0, if a sample S is picked by choosing i.i.d uniformly over X, m times, then 
with probability > 1 — i5 



\R{S,G{xi,...xi,i)) - R{P,G{xi, . . ,xi,i))\ < e{m,S) 



then, with probability >1 — 5 over the choice of S, 
Vxi, . . .xi € S \/i € I , 



\R{S,G{xi,...xi,i)) - R{P,G{xi,...xi,i))\ < e{m-l, \m\ ) 

One should note that the point of this lemma is the change of order of quan- 
tification. While in the assumption one first fixes x\, . . .xi,i and then randomly 
picks the samples S, in the conclusion we wish to have a claim that allows to 
pick S first and then guarantee that, no matter which xi, . . . xi,i is chosen, the 
S-cost of the clustering is close to its true P-cost. Since such a strong statement 
is too much to hope for, we invoke the sample compression idea, and restrict the 
choice of the xfs by requiring that they are members of the sample S. 

Proof (Sketch). The proof follows the lines of the uniform convergence results 
for sample compression bounds for classification learning. Given a sample S of 
size m, for every choice of I indices, i\, . . . ,ii € {1, . . . , m}, and i € I, we use the 
bound of Corollary 3 to bound the difference between the empirical and true risk 
of the clustering G(xi, . . . xi,i). We then apply the union bound to ‘uniformize’ 
over all possible such choices. 

In fact, the one-sided inequality. 



R(P, G(xi, ...xi,i)) < R(S, G{xi, ...xi,i)) + e 



suffices for proving the sample-based approximation results of this paper. 



3 Sample Based Approximation Results for Clustering in 
the General Setting 

Next we apply the convergence results of the previous section to obtain guar- 
antees on the approximation quality of sample based clustering. Before we can 
do that, we have to address yet another component of our paradigm. The con- 
vergence results that we have so far suffice to show that the empirical risk of a 
description scheme clustering that is based on sample points is close to its true 
risk. However, there may be cases in which any such clustering fails to approxi- 
mate the optimal clustering of a given input sample. To guard against such cases, 
we introduce our third property of clustering description schemes, the coverage 
property. 

The Coverage property: We consider two versions of this property: 
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Multiplicative coverage: A description scheme is a-m-covering for a clus- 
tering problem (X,T,R) if for every S C X s.t. [S'] > I, there exist 
{xi, ■ ■ - Xi} C S and i G I such that for every T G Tx, 

R{S, G{xi, . . .xi,i)) < aR{S, T) 

Namely, an optimal clustering of S can be a-approximated by applying the 
description scheme G to an ^-tuple of members of S. 

Additive coverage: A description scheme is rj- a- covering for a clustering prob- 
lem {X, T, R) if for every S C X s.t. |S'| > ^, there exist {xi, . . .xi} C S and 
i G I such that for every T gTx, 

R{S, G{xi, . . .xi,i)) < R{S, T) + rj 

Namely, an optimal clustering of S can be approximated to within (additive) 
77 by applying the description scheme G to an Z-tuple of members of S. 



We are now ready to prove our central result. We formulate it for the case 
of multiplicative covering schemes. However, it is straightforward to obtain an 
analogous result for additive coverage. 



Theorem 5. Let {X, T, R) be a clustering problem that has a local and complete 
description scheme which is a-m-covering, for some a > 1 . Then {X, V, R) is 
a-approximable from samples. 



Proof. Let m = O 




. Let T* G T he a clustering of X that minimizes 



R{P,T), and let S' C A be an i.i.d. P-random sample of size m. 

Now, with probability > 1 — 6, S satisfies the following chain of inequalities: 



— By Corollary 3, 



R{P,T*)+e> R{S,T*) 



— Let Opt{S) be a clustering of S that minimizes R{S,T). Clearly, 



R{S,T*)) > R{S, {Opt{S)) 



— Since G is a covering, for some x\, . . .xi G S and i G I, 

R(S, Opt(S)) > -R(S, G(xi, ...xi,i)) 
a 

— By Theorem 4, for the above choice of Xi . . . x/, i, 



i?(S, G{xi, ...Xi,i)) > R{P, G{xi, ...xi,i))-€ 



It therefore follows that 

R{P,G{xi,...xi,i)) < a{R{P,T*) + e) + e 

□ 
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Theorem 6. Let {X, T, R) be a clustering problem and let G{x\ . . . ,xi,i) be a 
local and complete description scheme which is ij-a-covering, for some rj G [0, 1] 

. Then for every probability distribution P over X and m » I, if a sample, S, 
of size m is generated i.i.d by P, then with probability exceeding 1 — 5, 



min{i?(P, G{x\, . . . ,xi,i)) : X\, . . . ,xi G S, i G 1} < 



min{i?(P, T) : T G T} + 7] + 



ln(|/|) + llnm + ln(l/5) 



2(m 



0 



The proof is similar to the proof of Theorem 5 above. 



4 1^-Median Clustering and Vector Quantization 

In this section we show how to apply our general results to the specific cases 
of it'-median clustering and vector quantization. We have already discussed the 
natural clustering description schemes for these cases, and argued that they are 
both complete and local. The only missing component is therefore the analysis 
of the coverage properties of these description schemes. 

We consider two cases, 

Metric itl-median problem where X can be any metric space. 

Euclidean itT-median where X is assumed to be a Euclidean space This 
is also the context for the vector quantization problem. 

In the first case there is no extra structure on the underlying domain metric 
space, whereas in the second we assume that it is a Euclidean space (it turns out 
that the assumption that the domain a Hilbert space suffices for our results). 

For the case of general metric spaces, we let G{x\, . . . ,Xk) be the basic de- 
scription scheme that assigns each point y to the Xi closest to it. (So, in this case 
we do not use the extra parameter i). 

It is well known, (see e.g., [5]) that for any sample S, the best clustering with 
center points from S is at most a factor of 2 away from the optimal clustering for 

5 (when centers can be any points in the underlying metric space). We therefore 
get that is that case G is a 2-m-covering. 

For the case of Euclidean, or Hilbert space domain, we can also employ a 
richer description scheme. For a parameter t, we wish to consider clustering 
centers that are the centers of mass of t-tuples of sample points (rather than 
just the sample points themselves). Fixing parameters t and r, let our index set 
/ be that is, the set of all vectors of length k whose entries are t-tuples 

of indices in {1, . . . , r}. Let Gt{xi, . . . , Xr, i) = G{xi^i, . . . Xi^tk), where i G 
indexes a sequence . . . Xi^kt) of points in {x \, . . . , Xr}, and G{xi^i, . . . Xi^ti) is 
the clustering defined by the set of centers : h G {0, . . . , A: — 1}}. 

That is, we take the ‘centers of mass’ of t tuples of points of S, where i is the 
index of the sequence of kt points that defines or centers. It is easy to see that 
such Gt is complete iff r > A:. 
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The following lemma of Maurey, [1], implies that, for t < r, this description 
scheme enjoys an 77 -a-coverage, for 77 = Ij'/i. 

Theorem 7 (Maurey, [1]). Let F he a vector space with a scalar product (•,•) 

and let \\f\\ = \J (/, /) he the induced norm on F. Suppose G C F and that, for 
some c > 0, ||( 7 || < c for all g € G. Then for all f from the convex hull of G and 
all k > 1 the following holds: 



inf 






i=l 



< 




Corollary 8. Consider the K median problem over a Hilhert space, X. For ev- 
ery t and r > max{fc,t}, the clustering algorithm that, on a sample S, outputs 
Argmin{R{Gt{xi , . . . , Xr, i)) '■ xi, . . . ,Xr & S, and i < r*^} produces, with proh- 
ahility exceeding 1 — 5 o clustering whose cost is no more then 

1 j k{t In r + In IS”!) + ln(l/i5) 

Tt^i 2 ( 1 ^ 1 -r) 

above the cost of the optimal k-centers clustering of the sample generating dis- 
tribution (for any sample generating distribution and any i5 > OJ. 



4.1 Implications to Worst Case Complexity 

As we mentioned earlier, worst case complexity models of clustering can be 
naturally viewed as a special case of the statistical clustering framework. The 
computational model in which there is access to random uniform sampling from 
a finite input set, can be viewed as a statistical clustering problem with P being 
the uniform distribution over that input set. 

Let {X, d) be a metric space, T a set of legal clusterings of X and R an ob- 
jective function. A worst case sampling-based clustering algorithm for (A, T, R) 
is an algorithm that gets as input finite subsets Y C X, has access to uniform 
random sampling over Y, and outputs a clustering of Y. 

Corollary 9. Let {X,T,R) he a clustering problem. Lf for some a > 1, there 
exist a clustering description scheme for {X, T, R) which is both complete and 
a-m-covering, then there exists a worst case sampling-based clustering algorithm 
for (A, T, R) that runs in constant time depending only of the approximation and 
confidence parameters, e and S (and independent of the input size \Y\) and out- 
puts an aOpt-\-e approximations of the optimal clustering for Y , with probability 
exceeding 1 — (5. 

Note that the output of such an algorithm is an implicit description of a 
clustering of T. It outputs the parameters from which the description scheme 
determines. For natural description schemes (such as describing a Voronoi 
diagram by listing its center points) the computation needed to figure out the 
cluster membership of any given y G Y requires constant time. 
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Abstract. The hierarchical mixture of experts architecture provides a 
flexible procedure for implementing classification algorithms. The clas- 
sification is obtained by a recursive soft partition of the feature space 
in a data-driven fashion. Such a procedure enables local classification 
where several experts are used, each of which is assigned with the task 
of classification over some subspace of the feature space. In this work, 
we provide data-dependent generalization error bounds for this class of 
models, which lead to effective procedures for performing model selec- 
tion. Tight bounds are particularly important here, because the model is 
highly parameterized. The theoretical results are complemented with nu- 
merical experiments based on a randomized algorithm, which mitigates 
the effects of local minima which plague other approaches such as the 
expectation- maximization algorithm. 



1 Introduction 

The mixture of experts (MoE) and hierarchical mixture of experts (HMoE) archi- 
tectures, proposed in [10] and extensively studied in later work, is a flexible ap- 
proach to constructing complex classifiers. In contrast to many other approaches, 
it is based on an adaptive soft partition of the feature space into regions, to each 
of which is assigned a ‘simple’ (e.g. generalized linear model (glim)) classifier. 
This approach should be contrasted with more standard approaches which con- 
struct a complex parameterization of a classifier over the full space, and attempt 
to learn its parameters. 

In binary pattern classification one attempts to choose a soft classifier / from 
some class if, in order to classify an observation a; G into one of two classes 
y & y = {—1, +1} using sgn(/(a;)). In the case of the 0 — 1 loss, the ideal classifier 
minimizes the risk Pe{f) = P{sgn(/(A)) E} = P{Y f{x) < 0}. If sgn (.E) 
consists of all possible mappings from to 3^, then the ultimate best classifier 
is the Bayes classifier /b(-A) = argmax^^^j; P{E = y\X}. In practical situations, 
the selection of a classifier is based on a sample Djq = {(X„,E„) G X x 
where each pair is assumed to be drawn i.i.d. from an unknown distribution 
P(A,E). 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 427—441, 2004. 
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In this paper we consider the class of hierarchical mixtures of experts classi- 
fiers [10], which is based on a soft adaptive partition of the input space, and a 
utilization of a small number of ‘expert’ classifiers in each domain. Such a pro- 
cedure can be thought of, on the one hand, as extending standard approaches 
based on mixtures, and, on the other hand, providing a soft probabilistic ex- 
tension of decision trees. This architecture has been successfully applied to re- 
gression, classification, control and time series analysis. It should be noted that 
since the HMoE architecture is highly parameterized, it is important to obtain 
tight error bounds, in order to prevent overfitting. Previous results attempting 
to establish bounds on the estimation error of the MoE system were based on 
the VC dimension [9] and covering number approaches [15]. Unfortunately, such 
approaches are too weak to be useful in any practical setting. 



2 Preliminary Results 

Consider a soft classifier /, and the 0—1 loss incurred by it, given by I[yf{x) < 0], 
where I[t < 0] is the indicator function of the event ‘t < O’. While we attempt 
to minimize the expected value of the 0—1 loss, it turns out to be inopportune 
to directly minimize functions based on this loss. First, the computational task 
is often intractable due to its non-smoothness. Second, minimizing the empirical 
0 — 1 loss may lead to severe overfitting. Many recent approaches are based on 
minimizing a smooth convex function <j){yf{x)) which upper bounds the 0—1 
loss (e.g. [20,12,1]). Define the (/)-risk, E^{f) = E {(j>(Y f{X))}, and denote the 
empirical 4>-risk by Efj,{f,D]^) = We assume that the 

loss function cj){t) satisfies (/)(0) = 1, (j){t) is Lipschitz with constant L^, < oo 

where = sup^gR 4>{t) and I[t < 0] < for all t. Using the (/)-risk instead 
of the risk itself is motivated by several reasons, (i) Minimizing the ^risk often 
leads asymptotically to the Bayes decision rule [20]. (ii) Rather tight upper 
bounds on the risk may be derived for finite sample sizes (e.g. [20,12,1]). (iii) 
Minimizing the empirical (/)-risk instead of the empirical risk is computationally 
much simpler. 

Data dependent error bounds are often derived using the Rademacher com- 
plexity. Let IF be a class of real- valued functions with domain . The empirical 
Rademacher complexity is defined as 



Riv(-F) = Ecr 



1 N 



where cr = (cti, CT 2 , ..., ctjv) is a random vector consisting of independently dis- 
tributed binary random variables with P(cr„ = 1) = P(cr„ = —1) = 1/2. The 
Rademacher complexity is defined as the average over all possible training se- 
quences, Rn{T) = E^i„.Rjv(.F'). 

The following Theorem, adapted from [2] and [16], will serve as our starting 
point. 
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Theorem 1. For every S € (0, 1) and positive integer N , with probability at 
least 1 — (5 over training sequences of length N , every f € IF satisfies 



Pe(f) < Mf, Dn) + 2L^Rn{P) + 30 

This bound is proved in three steps. First McDiarmid’s inequality [14] and a 
symmetrization argument [19] are used to bound supjgj^(Fl 0 (/) — E^{f,Dff)) 
with Rn{ 4> o iF), which is then bounded by Rn{4> ° P) using McDiarmid’s in- 
equality again. The claim is established by using the Lipschitz property of 0(-) 
to bound RN{(j)oiF) with L^R^itF) (e.g. [11,16]). In the sequel we upper bound 
i?Ar(lF) for the case where T is the HMoE classifier. 

Remark 1. The results of the Theorem can be tightened using the entropy 
method [4] . This leads to improved constants in the bounds, which are of partic- 
ular significance when the sample size is small. We defer discussion of this issue 
to the full paper. 




3 Mixture of Experts Classifiers 

Consider initially the simple MoE architecture defined in Figure 1, and given 
mathematically by 



M 

f ^ ^ ayyil^Wm-, Pjhr/rai^Vjyi^ xf ( 1 ) 

m—1 

We interpret the functions hm as experts, each of which ‘operates’ in regions 
of space for which the gating functions am are nonzero. Note that assuming 
am to be independent of x leads to a standard mixture. Such a classifier can 
be intuitively interpreted as implementing the principle of ‘divide and conquer’ 
where instead of solving one complicated problem (over the full space), we can 
do better by dividing it into several regions, defined through the gating functions 
am, and using ‘simple’ expert hm in each region. It is clear that some restric- 
tion needs to be imposed on the gating functions and experts, since otherwise 
overfitting is imminent. We formalize the assumptions regarding the experts and 
gating functions below. These assumptions will later be weakened. 

Definition 1 (Experts). For each 1 < m < M, let be some nonneg- 
ative scalar and Vm o, vector with k elements. Then, the m-th expert is given 
by a mapping hm{vm,x) where Vm G En = {v : ||z;|| < V™^}. We de- 
fine the collection of all functions hm(vm,x) such that Vm G Vm os TLm- To 
simplify the notation we define = sup^ and set H = Um=i Pm = 
^m—l^^mi'Om,^}, ’Om G Vm} • 

In the definitions below we write sup^^ instead of sup^^g^;;/^ umeVm’ 
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aj(x,Wj) 




f(x) 



Fig. 1. MoE classifier with M experts. 



Assumption 1. The following assumptions, serving the purpose of regulariza- 
tion, are made for each m, 1 < m < M. (i) To allow different types of experts, 
assume hm{vm,x) = hm{Tm{vm,x)) whcrc Tm{vm,x) is somc mapping such as 
v^x or \\vm — a;||- We assume that hm{Tm{vm,x)) is Lipschitz with constant 
^ x')') h'm(,Xm(,Xm 2 7 x')}\ ^ 5 5 I ■ 

(ii) \hmivm,x)\ is hounded by some positive constant < °o- 'S'o, by defin- 
ing M-h = we have that sup^ \hm{vm, a;)l < (Hi) The 

experts are either symmetric (for regression) or antisymmetric (for classifica- 
tion) with respect to the parameters so that hm{vm,x) = iyhm{—Vm,x) for some 
ize{±i}. 

Remark 2. Throughout our analysis we refer to x as a sample of the feature 
space. Yet, our results can be immediately extended to experts of the form 
hm{vm,x) = hra {vm^mix)) where d>m(x) may be a high-dimensional nonlinear 
mapping as is used in kernel methods. Since our results are independent of the 
dimension of they can be used to obtain useful bounds for local mixtures 
of kernel classifiers. The use of such experts results in a powerful classifier that 
may select a different kernel in each region of the feature space. 

The gating functions a(-, x) reflect the relative weights of each of the experts 
at a given point x. In the sequel we consider two main types of gating functions. 



Definition 2 (Gating functions). For each 1 < m < M, let be a 

nonnegative scalar and Wm o, vector with k elements. Then, the m-th gating 
function is given by a mapping am{wm,x) where Wm G Wm = {ru G : 
ll'ii'll < To simplify the notation we define = sup^W™^ and set 

“4= Um=l“4m = [jm=l{^m(^m,x)\Wm G W m} ■ Ifam{Wm,x) = a^(lC^x) WC 
say that am{wm,x) is a half-space gate, and if am{wm,x) = am (||wm — x|p/2) 
we say that am{wm,x) is a local gate. 
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Assumption 2. The following assumptions are made for every m, 1 <m < M. 
(i) am{vm,x) is Lipschitz with constant La^, analogously to Assumption 1. We 
define La = maxm La^- (ii) \am{vm,x)\ is hounded by some positive constant 
< oo. So, by defining AAa = maxm AiAm have sup^ „ a;)| < 

Ma- 

In Section 6 we will remove some of the restrictions imposed on the param- 
eters. 

4 Risk Bounds for Mixture of Experts Classifiers 

In this section we address the problem of bounding R^itF) where T is the class 
of all MoE classifiers defined in section 3. We begin with the following Lemma, 
the proof of which can be found in the appendix. 

Lemma 1. Let Tm = {am{Wm,x)hm{Vm,x)\am{Wm,x) € -4m ; hm (^m 1 ^ m } • 

Then, R^(T) = E™=i Rn(T,T). 

Thus, it is suffices to bound m = 1,2, . . . , M in order to establish 

bounds for Rff{iF). To do so, we use the following Lemma. 

Lemma 2. Let Gi,G 2 be two classes defined over some sets Ti, A ’2 respectively, 
and define the class G 3 as 

Gs = {9 ■ g{xi,x 2 ) = gi{xi)g 2 {x 2 ), gi G Gi,g 2 g G 2 } ■ 

Assume further that at least one of the sets Xi or X 2 is closed under negation 
and that every function in the class defined over this set is either symmetric or 
antisymmetric. Then, 



Z{G3) < M2Z{Gi) + MiZ{G2) , 

where Z(Gi) = E„. |supggg. 3 and Mi = 

\9^{x^)\ for i = 1,2. 

The proof of Lemma 2 is given in the Appendix. Note that a simpler deriva- 
tion is possible using the identity ab = (1/4) ((a -I- b)^ — {a — 6)^) . However, this 
approach leads to looser bound. This lemma implies the following corollary. 

Corollary 1. For every 1 < m < M define Tm os in Lemma 1. Then, 

RNi^m) < M-Hm^NiAm) + MA,„RN{'Hm) (w = 1, 2, . . . , M) . 

We emphasize that Corollary 1 is tight. To see that, set the gating function 
to be a constant. In this case RN{Am) = 0 and an equality is obtained by setting 
the gating variable to M Am - the sequel we use the following basic result (see 

[11,16] for a proof). 




432 



A. Azran and R. Meir 



Lemma 3. Assume ip is Lipschitz with constant and let g : x y 'R be 

some given function. Then, for every integer N 

N I f ^ 

sup ^ <T„V’ (5 iVn, f (Xn))) > < sup ^ cr„5(y„, f{Xn)) 





Remark 3. To minimize the technical burden, we assume the experts are gener- 
alized linear models (glim, see [13]), i.e. Tm{vm,x) = Tm{vf„x) in Assumption 
1. An extension to generalized radial basis functions (grbf), i.e. Tm{vm,x) = 
Tm (ll'i'm — xjl), is immediate using our analysis of local gating functions. Exten- 
sions to many other types can be achieved using similar technique. 



Using the Lipschitz property of the class Tim along with Lemma 3 we get 

L ( N ^ 



^ ^ sup ^ ^ ^ 



n—1 



By the Cauchy-Schwartz and Jensen inequalities we find that 



RN{TLm) < 



N 



< V" 



N 



CnXr 



< 






where a; = 

For the case of half-space gating functions we have a{w,x) = a{w"^x). 
In this case, analogous argumentation to the one used for the experts can 
be used to bound Rm{A). For the case of local gating functions we have 
a{w,x) = a (Ijru — x|p/2) . Similar arguments lead to the bound 

We summarize our results in the following Theorem. 



Theorem 2. Let T be the class of mixture of experts classifiers with M glim 
experts. Assume that gates 1,2,... , Mi are local and that gates Mi J- 1, . . . ,M 
are half-space where 0 < Mi < M . Then, 

Ml M M ' 

E Cl.m(wrj" + E E 

m—1 m—1 m—1 

where cq™ = MumRaml C 2 ,m = M-u„,La„,x and cs,™ = for all 

m = 1 , 2 , . . . , M . 



RMX) < ^ 
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5 Hierarchical Mixture of Experts 

The MoE classifier is defined by a linear combination of M experts. An intuitive 
interpretation to the meaning of this combination is the division of the feature 
space into subspaces, in each of which the experts are combined using the weights 
am- The Hierarchical MoE takes this procedure one step further by recursively 
dividing the subspaces using a MoE classifier as the expert in each domain, as 
described in Figure 2. 




Fig. 2. Balanced 2-level HMoE classifier with M experts. Each expert in the first level 
is a mixture of M sub-experts. 



In this section we expand the bound obtained for the MoE to the case of 
HMoE. We demonstrate the procedure for the case of balanced two-levels hier- 
archy with M experts (see Figure 2). It is easy to repeat the same procedure 
for any number of levels, whether the HMoE is balanced or not, using the same 
idea. 
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We begin by giving the mathematical description of the HMoE classifier. Let 
f{x) be the output of the HMoE, and let gm{dm,x) be the output of the m-th 
expert, 1 < m < M . The parameter 6m is comprised of all the parameters of the 
m-th first level expert, as will be detailed shortly. This is described by 

M 

fix) = X! ^rniw7n,x)gm{0m,x), 

m—1 

where am{wm,x) is the weight of the m-th expert in the first level gmi^^m^x), 
given by 

M 

gra iPva^ ^ ^ (jXmj j hmj i^mj , 

J = 1 



where amj{wmj,x) is the weight of hmj{vmj,x), the j-th (sub-)expert in the 
m-th expert of the first level. By defining 6mj = [wmj,Vmj], we have that 9m = 
[ 6 ml, • ■ • , OmAi]- We also define w = [wi, . . . , wm], the parameter vector of the 
gates of the first level and 9 = [w,9\,.- - ,9 m], the parameter vector of the 
HMoE. 

Recall that we are seeking to bound the Rademacher complexity for the case 
of HMoE. First, we use the independence of the first level gating functions to 
show that 

{Wm,Xn)g m )| • (2) 

So, our problem boils down to bounding the summands in (2). Notice that for 
every m = 1, . . . , M we have supg^{|(/m(0m, a^) 1} < MM-uMa- By defining Tm 
for the case of the 2-level HMoE analogously to the definition given at Lemma 
1 for MoE, and using Corollary 1 recursively twice, it is easy to show that 



M 



Rn{T) = ^ E„ 



sup 



1 ^ 

vi: 



RN{fFm) < M M.-uAi aRn {A) + A4a^o- 



N M 

^mj {Wmj ^ ) h-mj {^mj 

n=l j=l 



M r 1 ^ 

— MA4-hAIaRn(A) + A4a Eo- < sup — E (^mj ; )^mj ('^mj t ^n) 

j = l n=l 

< MM-hMaRn(A) + MMa {m-hRn{A) + MaRn{H)) 

= MMa [2M-hRn(A) + MaRnW] 



which, combined with Corollary 1 implies Theorem 3. 



Theorem 3. Let T be the class of balanced 2-level hierarchical mixture of ex- 
perts classifiers with M experts in each division (see Figure 2). Then, 



Rn{T) < M^Ma 2MnRN{A) + MaRnCH) 



Notice that by choosing the constants more carefully, similar to Theorem 2, the 
bound in Theorem 3 can be tightened. 




Data Dependent Risk Bounds for Hierarchical Mixture of Experts Classifiers 



435 



6 Fully Data Dependent Bounds 



So far, the feasible set for the parameters was determined by a ball with a pre- 
defined radius (W„ax for the gates or V„ax for the experts). This predefinition 
is problematic as it is difficult to know in advance how to set these parameters. 
Notice that given the number of experts M, these predefined parameters are the 
only elements in the bound that do not depend on the training sequence. In this 
section we eliminate the dependence on these preset parameters. Even though 
we give bounds for the case of MoE, the same technique can be easily harnessed 
to derive fully data dependent bounds for the case of HMoE. 

The derivation is based on the technique used in [6]. The basic idea is to 
consider a grid of possible values for and for each of which Theorem 

2 holds. Next, we assign a probability to each of these grid points and use a 
variant of the union bound to establish a bound that holds for every possible 
parameter. 

Similarly to the definition of 6 in section 5, we define for the MoE classifier 
9 = [01 , 6*2, . . . , 02m] where 9m = Wm for all m = 1, 2, . . . ,M and 9m = Vm for all 
m = M-|-l,M-|-2,... , 2M. The following result provides a data dependent risk 
bound with no preset system parameters, and can be proved using the methods 
described in [16]. 



Theorem 4. Let the definitions and notation of Theorem 2 hold. Let go be some 
positive number, and assume \\9m\\ > go for every m = 1, . . . ,2M. Then, with 
probability at least 1 — 5 over training sequences of length N , every function 
f € IF satisfies 



Pfif)<Edf,DN)+^ 



■ Ml M 2M 

2 Cl,m||0m||^ -t- C2,m||0m|| + C3,m||0n 

m=l m=l m=Af+l 



+ 3<prr 






In - -I- 2 ^ In log. 



2 || 0 ™ 

go 



Remark f. Theorem 4 can be generalized to hold for all 9 (without the restriction 
ll^^mjj > go), by using the proof method in [6], [16]. 

7 Algorithm and Numerical Results 

We demonstrate how the bound derived in Section 4 can be used to select the 
number of experts in the MoE model. We consider algorithms which attempt 
to minimize the empirical </>-loss Erf,{f , Diq). It should be noted that previous 
methods for estimating the parameters of the MoE model were based on gra- 
dient methods for maximizing the likelihood or minimizing some risk function. 
Such approaches are prone to problems of local optima, which render standard 
gradient descent approaches of limited use. This problem also occurs for the EM 
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algorithm discussed in [10]. Notice that even if (j){yf{x)) is convex with respect 
to yf{x), this doesn’t necessarily imply that it is convex with respect to the 
parameters of f{x). The deterministic annealing EM algorithm proposed in [17] 
attempts to address the local maxima problem, using a modified posterior dis- 
tribution parameterized by a temperature like parameter. A modification of the 
EM algorithm, the split-and-merge EM algorithm proposed in [7], deals with 
certain types of local maxima involving an unbalanced usage of the experts over 
the feature space. 

One possible solution to the problem of identifying the location of the global 
minimum of the loss is given by the Cross-Entropy algorithm (see [5] for a recent 
review, [18]). This algorithm, similarly to genetic algorithms, is based on the idea 
of randomly drawing samples from the parameter space and improving the way 
these samples are drawn from generation to generation. We observe that the 
algorithm below is applicable to finite dimensional problems. 

To give an exact description of the algorithm used in our simulation we first 
introduce the following notation. We let the definition of 9 from section 6 hold 
and denote by 0 the feasible set of values for 9. We also define a parameterized 
p.d.f. over 0 with ^ parameterizing the distribution. 

To find a point that is likely to be in the neighborhood of the global minimum, 
we carry out Algorithm 1 (see box) . Upon convergence, we use gradient methods 
with 9f (see box for definition) as the initial point to gain further accuracy in 
estimating the global minimum point. We denote by 9^ the solution of the 
gradient minimization procedure and declare it as the final solution. 

Simulation setup. We simulate a source generating data from a MoE classi- 
fier with 3 experts. The Bayes risk for this problem is 18.33%. We used a training 
sequence of length 300, for which we carried out Algorithm 1 followed by gra- 
dient search with respect to where 4>{t) = 1 — tanh(2t). Denoting 

by the classifier that was selected for each M = 1, 2, . . . , 5, we denote by 
, D]\f) the minimal empirical (/>-risk obtained over the class. We evalu- 
ate the performance of each classifier by computing PeUm^ ^ Dtest) over a test 
sequence of 10® elements (Dtest), drawn from the same source as the training 
sequence. This is the reported probability of error Pe{f)- Figure 1 describes 
these two measures computed over 400 different training sequences (the bars 
describe the standard deviation). The graph labelled as the ‘complexity term’ 
in Figure 1 is the sum of all terms on the right hand side of Theorem 2 with 
5 = 10“®, excluding Djq). As for the CE parameters, we set ■0e(-) to be 

the (3 distribution, Co = [Ij 1] (corresponds to uniform distribution), p\ = 0.03, 
P 2 = 0.001, p 3 = 0.7 and T = 200. The results are summarized in Figure 1. 

A few observations are in place: (i) As one might expect, , Djy) is 

monotonically decreasing with respect to M. (ii) As expected, the complexity 
term is monotonically increasing with respect to M and (iii) Pe{f) is the closest 
to the Bayes error (18.33%) when M = 3, which is the Bayes solution. We witness 
the phenomenon of underfitting for M = 1,2 and overfitting for M = 4,5, as 
predicted by the bound. 
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The Cross-Entropy Algorithm. 

Input: ipe{.) and 

Output: , a point in the neighborhood of the global minimum of E{f{9),DN)- 

Algorithm : 

1. Pick some (a good selection will turn ij)e{0\ ^o) into a uniform distribution over 
O). Set iteration counter s = 1, two positive integers d, T and three parameters 

0 < P1,P2,P3 < 1. 

2. Generate an ensemble where L = 2kMT [k is the dimension of the 

feature space and M is the number of experts, thus the dimension of 0 is 2kM), 
drawn i.i.d according to 

3. Calculate for each member of the ensemble. The Elite Sample (ES) 

comprises the [piij parameters that received the lowest empirical cji-risk. Denote 
the parameters that are associated with the worst and the best E^{f, Dm) in the 
ES as 0^ and 6f respectively. 

4. If for some s > d 

max (pY — OY ) < P 2 

s — d<i,j<s 

stop (declare 9f as the solution). Otherwise, solve the maximum likelihood esti- 
mation problem, based on the ES, to estimate the parameters of ipe (notice that 
it is not a MLE for the original empirical risk minimization problem). Denoting 
the solution as P,ml, compute = (1 — pa)^s + Ps^ml- Set s = s -I- 1 and return 
to 2. 



Algorithm 1: The Cross-Entropy Algorithm for estimating the location of the global 
minimum of the empirical <()-risk. 



We also applied a variant of Algorithm 1, suitable for unbounded parameter 
feasible set (the details will be discussed in the full paper), to the real-world 
data sets bupa and pima [3] . We considered a MoE classifier with 1 to 4 linear 
experts, all with local gates. The results are compared with those of linear-SVM 
and RBF-SVM in Table 1. 

8 Discussion 

We have considered the hierarchical mixture of experts architecture, and have 
established data dependent risk bounds for its performance. This class of ar- 
chitectures is very flexible and overly parameterized, and it is thus essential to 
establish bounds which do not depend on the number of parameters. Our bounds 
lead to very reasonable results on a toy problem. Also, the simulation results on 
real world problems are encouraging and motivate further research. Since the 
algorithmic issues are rather complicated for this architecture, it may be advan- 
tageous to consider some of the variational approaches proposed in recent years 
(e.g. [8]). We observe that the HMoE architecture can be viewed as a member of 
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Table 1. Real world data sets results. The results were computed using 7-fold cross- 
validation for BUPA and 10-fold cross-validation for PIMA. For each fold, the parameters 
of the classifiers were selected using cross-validation in the training sequence. 



Data set 


MoE (2 experts) 


Linear-SVM 


RBF-SVM 


BUPA 


0.289 ±0.050 


0.320 ±0.084 


0.317 ±0.048 


PIMA 


0.241 ±0.056 


0.244 ± 0.050 


0.255 ±0.067 



the large class of widely used graphical models (a.k.a. Bayesian networks). We 
expect that the techniques developed can be used to obtain tight risk bounds 
for these architectures as well. 



Data dependent beund 






M M 

Fig. 3. A comparison between the data dependent bound of Theorem 2 and the true 
error, computed over 400 Monte Carlo iterations of different training sequences. The 
solid line describes the mean and the bars indicate the standard deviation over all 
training sequences. The two figures on the left demonstrates the applicability of the 
data dependent bound to the problem of model selection when one wishes to set the 
optimal number of experts. It can be observed that the optimal predicted value for M 
in this case is 3, which is the number of experts used to generate the data. 
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A Proofs of Some of the Theorems 



Proof of Lemma 1 To simplify the notation, we write sup,„ „ instead of 
sup^gjy„gy. Since, by definition, the set of parameters {wi,Vi) is independent 
of (wjjVj) for any 1 < i,j < M, i ^ j we have 



( ^ N M 'I 

— Per \ sup —— ^ ^ (Jn ^ ^ ^n) r 

w,v -'’ll 

n—1 m=l J 



M 



N 



— ^ ^ ^(T ^ sup ^ ^ ^n) ^ ■ 



□ 

Proof of Lemma 2 First, we introduce the following Lemma 
Lemma 4. For any function C{gi,g 2 ^x), there exist u G {il} such that 



Eo- sup {C{gi, g 2 ,x) agi{x)g 2 {x)) 

<91^92 



<Ea- { sup (C{gi,ng 2 ,x) + M 2 <Jgi{x) + Miag 2 {x)) 

<91,92 



Proof, (of Lemma 4) 

E,, I sup {C{gi,g 2 ,x) + agi{x)g 2 {x)) 



= 1 sup {C{gi, g 2 ,x) + gi(x)g 2 {x)) + ^ sup (C{gi,g 2 ,x) - gi{x)g 2 {x)) 

^ 91,92 ^ 91,92 





1 

0 


sup 


{C{gi,g 2 ,x) 


+ 9,{x)g2{x) - 


‘rC{gi,g2,x) - 


-51 {x)g2{x)) 




^ 91, 


92,91, 


92 










1 

0 


sup 


iC(gi,g 2 ,x] 


1 + C[gi,g2,x) 


+ \gi{x)g2{x) 


- h{x)Mx)\) 


■ 

< 


^ 91 


,92,91 


,92 








1 

291 


sup 

,92,91 


(C[gi,g2,x) 

,92 


+ C[gi,g2,x) 


+ A4i|p2(a:) - 


h{x)\ +M. 2 \gi{x) 



(3) 

where (a) is due to the symmetry of the expression over which the sepremum 
is taken and (b) is immediate, using the following inequality 



\gi{x)g 2 {x) - gi{x)g 2 {x)\ = \gi{x){g 2 {x) - g 2 {x)) + g 2 {x){gi{x) - ffi(a;))| 

< Mi\g 2 {x) -g 2 {x)\ + M 2 \gi{x) - 5 i(x)|. 

Next, we denote by g\, g 2 i g*T ^2 the functions over which the supremum in (3) 
is achieved and address all cases of the signum of the terms inside the absolute 
values at (3). 
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case l :g;(x) > g^ix), gt{x) > g*{x) 

sup {C{gi,g 2 ,x) + C{gi,g 2 ,x) + Mi{g 2 (x) - g 2 (x)) + M 2 {gi{x) - Si(a:))} 

91 ,92 >91 ’92 

= sup {C{gi, g 2 ,x) + Mig 2 (x) + M 2 gi{x)} + sup {C(gi,g 2 ,x) - Mig 2 (x) - M 2 gi(x)} 

91,92 91,92 

= 2Eo- sup {C{gi,g 2 ,x) + Mi<Tg 2 {x) + M 2 <ygi{x)} 

91 ,92 

case 2 : g^{x) > gl{x), gl{x) < g\{x) 

sup {C{gi,g 2 ,x) + C(gi,g 2 ,x) + Mi{g 2 {x) - g 2 (x)) + M 2 (gi(x) - gi{x))} 

91 ,92>91,92 

sup _ {C(gi, -g 2 ,x) + c(gi, -g 2 ,x) + Mi(g 2 (x) - g 2 (x)) + M 2 (gi(x) - gi(x))} 

9l ,92’9 i ,92 

— 2Eo- sup {C{gi, -g 2 ,x) + Mi<7g2{x) + M. 2 <ygi{x)} 

91 ’92 

where (a) is due to the assumption that Q 2 is close under negation. Notice that 
the cases where g2(x) < g2(a;), gl{x) < g\{x) and g 2 {x) < g^ix), gl(x) > gl(x) 
are analogous to cases 1 and 2 respectively. □ 

We can now provide the proof of Lemma 2. By using Lemma 4 recursively 
with a suitable definition of C{g\, g 2 ,x) in each iteration, we have for every 
t = ,iV+l 

Ecr < sup (T„gi {x„)g2{x„) 

[91,92 

< E„. < sup CTngi{x„)g2 

[91.92 V„=t 



(xn) + M 2 ^ cr„gi(x„) + Ml ^ r{n,t)a„g2{xn) 



where 



r{n,t) 



t-2 






1 

not defined 



if n < f — 2 
if n = f — 1 . 
otherwise 



By setting t = N +l we get 

r N 

sup (Jn9l{Xn)92{Xn) 

[9l,92„^;^ 

r N 1 



N 

< 7 W 2 E 0 . sup^cr„ 5 i(a:„) [ + MiEa- sup E(n, N + l)a„g 2 (a^«) 



n—1 



n—1 



Recall that i^i G {±1} Vf and thus r{n,N + 1) G {±1} Vn. So, by redefining 
(7n = nl'n ViOn Vn for the second term of the last inequality, we complete the 
proof of Theorem 2. □ 
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Abstract. Motivated by sensor networks and other distribnted settings, 
several models for distributed learning are presented. The models differ 
from classical works in statistical pattern recognition by allocating ob- 
servations of an i.i.d. sampling process amongst members of a network 
of learning agents. The agents are limited in their ability to commnni- 
cate to a fusion center; the amount of information available for classi- 
fication or regression is constrained. For several simple communication 
models, questions of universal consistency are addressed; i.e., the asymp- 
totics of several agent decision rules and fusion rules are considered in 
both binary classification and regression frameworks. These models re- 
semble distributed environments and introduce new questions regarding 
universal consistency. Insofar as these models offer a useful picture of 
distributed scenarios, this paper considers whether the guarantees pro- 
vided by Stone’s Theorem in centralized environments hold in distributed 
settings. 



1 Introduction 

1.1 Models for Distributed Learning 

Consider the following learning model: Suppose X and Y are IR'^-valued and 
y-valued random variables, respectively, with joint and marginal distributions 
denoted by Pxvj Px, and Py. Suppose y C IR but is otherwise unspecified 
for now; we will consider cases where y = {0,1} and y = IR. Suppose further 
that Dn = {(Aj,y)}r=i is an independent and identically distributed (i.i.d.) 
collection of training data with (Ai,y) ~ Pxy for all i G (1, 

If D„ is provided to a single learning agent, then we have a traditional cen- 
tralized setting and we can pose questions about the existence of classifiers or 
estimators that are universally consistent. The answers to such questions are 
well understood and are provided by results such as Stone’s Theorem [1], [2], [3] 
and numerous others in the literature. 

* This research was supported in part by the Army Research Office under grant 
DAAD19-00-1-0466, in part by Draper Laboratory under grant IR&D 6002, in part 
by the National Science Foundation under grant CCR-0312413, and in part by the 
Office of Naval Research under Grant No. N00014-03-1-0102. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 442—456, 2004. 
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Instead, suppose that for each i G the training datum (Xi,Vi) 

is received by a distinct member of a network of n simple learning agents. At 
classification time, the central authority observes a new random feature vector X 
distributed according to Px and communicates it to the network in a request for 
information. At this time, each agent can respond with at most one bit. That is, 
each learning agent chooses whether or not to respond to the central authority’s 
request for information; if it chooses to respond, an agent sends either a 1 or 
a 0 based on its local decision algorithm. Upon observing the response of the 
network, the central authority fuses the information to create an estimate of Y . 

When y = {0, 1}, we have a binary classification framework and it is natural 
to consider the probability of misclassification as the performance metric for the 
network of agents. Similarly, when = M, we have a natural regression frame- 
work and as is typical, we can consider the expected L^-risk of the ensemble. 
A key question that arises is: given such a model, do there exist agent decision 
rules and a central authority fusion rule that result in a universally consistent 
ensemble in the limit as the number of agents increases without bound? 

In what follows, we answer this question in the affirmative for both classifi- 
cation and regression. In the binary classification setting, we demonstrate agent 
decision rules and a central authority fusion rule that correspond nicely with 
classical kernel classifiers; the universal Bayes-risk consistency of this ensemble 
then follows immediately from celebrated analyses like Stone’s Theorem, etc. In 
the regression setting, we demonstrate that under regularity, randomized agent 
decision rules exist such that when the central authority applies a scaled ma- 
jority vote fusion of the agents’ decisions, the resulting estimator is universally 
consistent for L^-risk. 

In this model, each agent’s decision rule can be viewed as a selection of one 
of three states: abstain, vote and send 1, and vote and send 0. The option to 
abstain essentially allows the agents to convey slightly more information than 
the one bit that is assumed to be physically transmitted to the central authority. 
With this observation, these results can be interpreted as follows: log2(3) bits 
per agent per classification is sufficient for universal consistency to hold for both 
distributed classification and regression with abstention. 

In this view, it is natural to ask whether these log2(3) bits are necessary. 
Can consistency results be proven at lower bit rates? Consider a revised model, 
precisely the same as above, except that in response to the central authority’s 
request for information, each agent must respond with 1 or 0; abstention is not 
an option and thus, each agent responds with exactly one bit per classification. 
The same questions arise: are there rules for which universal consistency holds 
in distributed classification and regression without abstentionl 

Interestingly, we demonstrate that in the binary classification setting, ran- 
domized agent rules exist such that when a majority vote fusion rule is applied, 
universal Bayes-risk consistency holds. Moreover, it is clear that one bit is neces- 
sary. As an important negative result, we demonstrate that universal consistency 
in the L^-risk regression framework is not possible in the one bit regime, under 
reasonable assumptions on the candidate decision rules. 
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1.2 Motivation and Background 

Motivation for this problem lies in sensor networks [4]. Here, an array of sensors 
is distributed across a geographical terrain; using simple sensing functionality, 
the devices observe the environment and locally process the information for 
use by a central monitor. Given locally understood statistical models for the 
observations and the channel that the sensors use to communicate, sensors can be 
preprogrammed to process information optimally with respect to these models. 
Without such priors, can one devise distributed sensors that learn? Undoubtedly, 
the complexity of communication in this environment will complicate matters; 
how should the sensors share their data to maximize the inferential power of the 
network? 

Similar problems exist in distributed databases. Here, there is a database of 
training data that is massive in both the dimension of the feature space and 
quantity of data. However, for political, economic or technological reasons, this 
database is distributed geographically or in such a way that it is infeasible for 
any single agent to access the entire database. Multiple agents can be deployed 
to make inferences from various segments of the database. How should the agents 
communicate in order to maximize the performance of the ensemble? 

The spirit of the models presented in this paper is in line with models con- 
sidered in nonparametric statistics and the study of kernel methods and other 
Stone-type rules. Extensive work has been done related to the consistency of 
Stone-type rules under various sampling processes; see [2], [3] and references 
therein, [5], [6], [7], [8], [9], [10], [11], [12], [13], [1], [14]. These models focus on 
various dependency structures within the training data and assume that a single 
processor has access to the entire data stream. However, in distributed scenar- 
ios, many agents have access to different data streams that differ in distribution 
and may depend on external parameters such as the state of a sensor network 
or location of a database. Moreover, agents are unable to share their data with 
each other or with a central authority; they may have only a few bits with which 
to communicate a summary. 

The models presented in this paper differ from the works just cited by allo- 
cating observations of an i.i.d. sampling process to individual learning agents. 
By limiting the ability of the agents to communicate, we constrain the amount 
of information available to the ensemble and the central authority for use in 
classification or regression. These models more closely resemble a distributed 
environment and present new questions to consider with regard to universal 
consistency. Insofar as these models offer a useful picture of distributed scenar- 
ios, this paper considers whether the guarantees provided by Stone’s Theorem 
in centralized environments hold in distributed settings. 

Numerous other works in the literature are relevant to the research presented 
here. However, different points need to be made depending on whether we con- 
sider regression or classification with or without abstention. Without context, 
we will save such discussion for the appropriate sections in the paper. 

The remainder of this paper is organized as follows. In Section II, the rele- 
vant notation and technical assumptions are introduced. In Sections HI, owing 
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to an immediate connection to Stone’s Theorem, we briefly present the result for 
distributed classification with abstention. In Section IV, we present the results 
for regression with abstention. In Section V and VI, we discuss the results for 
the model without abstention in the binary classification and regression frame- 
works, respectively. In each section, we present the main results, discuss impor- 
tant connections to other work in nonparametrics, and then proceed to describe 
the basic structure of the associated proof. Technical lemmas that are readily 
apparent from the literature are left to the appendix in Section VII. 



2 Preliminaries 

As stated earlier, suppose X and Y are IR'^-valued and V-valued random vari- 
ables, respectively, with joint and marginal distributions denoted by PxFj Pjv, 
and Py. Assume V C IR. Suppose further that is an in- 

dependent and identically distributed (i.i.d.) collection of training data with 
(X^,Yi) ~ PxY for all i G {!,..., n}. 

When y = {0, 1}, Pxy specifies a binary classification problem. Let Sb{x) : 
IR'^ — >■ y denote the Bayes decision rule for this problem and use R* to denote 
the minimum Bayes risk. 



R* =P{6b{X)^Y}. (1) 

When V = fR, Pxy specifies a regression problem and as is well known, the 
regression function 



7j{x) = E{Y\X = x} (2) 

minimizes E{|/(A) — Vp} over all measurable functions /. 

Throughout this paper, we will use Sni(x) to denote the learning agent’s 
decision rule in an ensemble of n agents. For each i G {l,...,n}, Sni(x) = 
S„i(x, Xi,Yi) : IR'^ X IR'^ x V — >■ 5 is a function of the observation X made 
by the central authority and (Xi,Yi), the training data observed by the agent 
itself. Here S is the decision space for the agent; in models with abstention we 
take S = {abstain, send 1, send 0} and in models without abstention we take 
S = (send 1, send 0}. In various parts of this paper, agent decision rules will be 
randomized; in these cases Sni(x) = Sni{x, Xi,Yi, Z^i) is dependent on an addi- 
tional random variable Zni- Consistent with this notation, we assume that the 
agents have knowledge of n, the number of agents in the ensemble. Moreover, 
we assume that for each n, every agent has the same local decision rule; i.e., the 
ensemble is homogenous in this sense. 

We use gn{x) = gni,x,{5ni{,x)}'l^-^) : IR'’* x {0,1}” — >■ {0,1} to denote the 
fusion rule in the binary classification frameworks and similarly, we use fjnix) = 
rin{x, {6ni(x)}2=i}) ■ lR‘^x{0, 1}" — >■ IR to denote the fusion rule in the regression 
frameworks. 
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3 Distributed Classification with Abstention: Stone’s 
Theorem 



In this section, we show that the universal consistency of distributed classification 
with abstention follows immediately from Stone’s Theorem and the classical 
analysis of naive kernel classifiers. To start, let us briefly recap the model. Since 
we are in the classification framework, y = {0, 1}. Suppose that for each i G 
{l,...,n}, the training datum (Xi,Yi) G Dn is received by a distinct member 
of a network of n learning agents. At classification time, the central authority 
observes a new random feature vector X and communicates this to the network 
of learning agents in a request for information. At this time, each of the learning 
agents can respond with at most one bit. That is, each learning agent chooses 
whether or not to respond to the central authority’s request for information; and 
if an agent chooses to respond, it sends either a 1 or a 0 based on a local decision 
algorithm. Upon receiving the agents’ responses, the central authority fuses the 
information to create an estimate of Y . 

To answer the question of whether agent decision rules and central authority 
fusion rules exist that result in a universally consistent ensemble, let us construct 
one natural choice. With Br„{x) = {y G M’’* :|| x — y |l 2 < r„}, let 



(Yi, YX.GBrAx) 

\ abstain, otherwise 



( 3 ) 



and 



9n{x) 



2^ ain} 

’ Y2i=l l{5^.^(a:)7‘abstain} 

0, otherwise 






( 4 ) 



so that gn{x) amounts to a majority vote fusion rule. With this choice, it is 
straightforward to see that the net decision rule is equivalent to the plug-in 
kernel classifier rule with the naive kernel. Indeed, 



0, otherwise 



( 5 ) 



With this equivalence, the universal consistency of the ensemble follows from 
Stone’s Theorem applied to naive kernel classifiers. With = P{^„(A) ^ 
Y\Dn}, the probability of error of the ensemble conditioned on the random 
training data, we state this known result without proof as Theorem 1. 

Theorem 1. ([2]) If, as n ^ oo, r„ — >• 0 and r„n — >• oo, then — >• R* for 

all distributions Pxy- 



4 Distributed Regression with Abstention 

A more interesting model to consider is in the context of regression, estimating 
a real-valued concept in a bandwidth starved environment. As above, the model 
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remains the same except that 3^ = IR; that is, F is a K-valued random variable 
and likewise, agents receive real- valued training data labels, Yi. 

With the aim of determining whether universally consistent ensembles can 
be constructed, let us devise candidate rules. These rules will be randomized; 
however they will adhere to the communication constraints of the model. 

For each integer n, let {Z„^e}ee[o,i] be a family of random {0, 1}- valued 
random variables parameterized by [0, 1] such that for each 9 € [0, 1], Znfi is 
Bernoulli with parameter 9. 

Let and be arbitrary sequences of real numbers such that 

c„ — >■ oo and r„ — >■ 0 as n — >■ oo. Let Sni(x) be defined as: 

if X G Br„{Xi) and \Yi\ < c„ 

Zj i, if X G and |Fi| > c„ , (6) 

abstain, otherwise 



for i = 1, ...,n. In words, the agents choose to vote if Xi is close enough to X] 
to vote, they flip a biased coin, with the bias determined by Y^ and the size of 
the ensemble, n. 

Let us define the central authority fusion rule: 



fi„{x) = 2Cn(^ 



(X)T^abstain} 1 

X)i=l l{5„i(X)5^abstain} 2 






( 7 ) 



In words, the central authority shifts and scales a majority vote. 

In this regression setting, it is natural to consider the L^-risk of the ensemble. 
Here, we will consider E{|? 7 „(X) — ri{X)\'^} with the expectation taken over X, 
Dn = {{Xi,Yi)}^^T^, and the randomness introduced in the agent decision rules. 



4.1 Main Result and Comments 

Assuming an ensemble using the described decision rules. Proposition 1 specifies 
sufficient conditions for consistency. 

Proposition 1. Suppose Pxy is such that Px is compactly supported and 
E{F^} < oo. If, as n ^ oo, 

1. c„ — >■ oo , 

2. Tn — >■ 0, and 

3. 0, 

then E{|f}„(A) - rj{X)\'^} 0. 

More generally, the constraint regarding the compactness of Pjy can be weak- 
ened. As will be observed in the proof below, Pjy must be such that when coupled 
with a bounded random variable Y , there is a known convergence rate of the 
variance term of the naive kernel classifier (under a standard i.i.d. sampling 
model). should be chosen so that it grows at a rate slower than the rate 

at which the variance term decays. Notably, to select one does not need 
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to understand the convergence rate of the bias term, and this is why continuity 
conditions are not required; the bias term will converge to zero universally as 
long as c„ — >■ oo and r„ — >■ 0 as n — >■ oo. 

Note that the divergent scaling sequence is required for the general 

case when there is no reason to assume that Y has a known bound. If, instead, 
|F| < B a.s. for some known B > 0, it suffices to let Cn = B for all n. 

Given our choice of agent decision rules, it is natural to ask whether the 
current model can be posed as a special case of regression with noisy labels. If 
so, the noise would map the label Yi to the set {0, 1} in a manner that would 
be statistically dependent on X, Xi, Yi itself and n. Though it is possible to 
view the current question in this framework, to our knowledge such a highly 
structured noise model has not been considered in the literature. 

Finally, those familiar with the classical statistical pattern recognition lit- 
erature will find the style of proof very familiar; special care must be taken to 
demonstrate that the variance of the estimate does not decrease too slowly com- 
pared to and to show that the bias introduced by the “clipped” agent 

decision rules converges to zero. 



4.2 Proof of Proposition 1 

For ease of exposition, let us define a collection of independent auxiliary random 
variables, such that X — >■ T — >■ forms a Markov chain and satisfies, 



P Zn\Y 



1 r+1 ’ 

’cn +2 

Pz 1 , 1^1 > c„ 

n, 2 



for all n. Pz^ a i® defined in the section above. 



Proof. In the interest of space, we will not repeat the parts of the proof common 
to the analysis of other Stone-type rules; instead we highlight only the parts of 
the proof where differences arise. 

Let fjn{x) = E{Z„ \X = cc}. Proceeding in the traditional manner, note that 
by inequality (a -I- 6)^ < 2a^ -I- 26^, 



E{|7)„(X)-7;(X)|2}<2e{|2c„( 

- 2c„( 



Yh=1 

Yh=i l{^ieSr„(x)} 

1 ]”=! Vn{Xi)l{XieBr„{X)} 



2El 



I 2c„(^ 



J27=l l{XiGSr„(X)} 

Yh= 1 Vn{Xi)l{XidBrXZ^)} 



) 

)D 



1' 

1 \ I 2 >, 
2. 



Er=ii 



= 1 '^{XidBrXX)} 



\)-H) 



— T -Y 1 



Note that is essentially the variance of the estimator. Using arguments typical 
in the study of Stone- type rules ([2]), it is straightforward to show that 



B < 4c“E<^ 



nPx{BrAX)} 






(8) 
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Since Px is compactly supported, the expectation in (8) can be bounded by a 
term 0(:^) using an argument typically used to demonstrate the consistency 
of kernel estimators [3]. This fact implies that, 




and thus, by condition (3) of Proposition 1, /„ — >• 0. Taking care to ensure that 
the multiplicative constact c„ does not cause the variance term to explode, this 
argument is essentially the same as showing that in traditional i.i.d. sampling 
process settings, the variance of naive kernel is universally bounded by a term 
when Px is compactly supported and Y is bounded [3]. This observation 
is consistent with the comments above. 

Now, let us consider Fix e > 0. We will show that for all sufficiently 
large n, J„ < e. Let r]^{x) be a bounded continuous function with bounded 
support such that E{|? 7 g(X) — rj{X)\'^} < Since E{T^} < oo implies that 
rj{x) G L^(Px), such a function is assured to exist due to the density of such 
functions in T^(Px). By the inequality (a + 6+c + d)^ < (4a^ + 46^ + 4c^ +4d^), 



<4Ef2c„( xl 1 I 

+ 4Ef .. 1 

+ 4e{ £.L4.(A-)1|x..b..<x,| _ 

z^i=i l{^ies,„(x)} J 

+ AE{MX) - rj{X)\^} 

— 4(Jnl + Jn2 + d „3 + J ni) ■ 



One can show that for some constant c, 

< 2ce{|2c„«„(V) - ' 

Essentially, this follows by applying several algebraic bounds and technical 
Lemma 4. Continuing with the familiar inequality (a + b)^ < 2c? + 2\? , 

J„i <2eE{|2c„(274X)-l)-27(X)|'}+E{|r,,(X)-77(X)|^} + E{ ^p^^^" J . 

Note that ^„(x) = E{Z„ |X = a;} = e|( 2 ^F + ^)1{\y\<c„} + ^^{\y\>c^} X = 
a;|. Substituting this above and applying Jensen’s inequality, we have 

a„. < 2cE(y»i,|,„..,) + ± } ■ (9) 
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By the monotone convergence theorem, the first term in (9) converges to zero. 
The second term in (9) converges to zero by the same argument applied for 
Thus, lim, 2 _^oo Jn\ ^ Y 2 ' 

Using the uniform continuity of in combination with the fact that — >■ 0, 

it is straightforward to show that J „2 < ^ for all sufficiently large n. 

Using the boundedness of it is straightforward to show that 

Jn 3 < sup(t7,(x)^)e| ’ 

and thus, J „3 — >■ 0 by the same argument applied to Finally, J „4 < ^ 
by our choice of r]^{x). Combining each of these observations, it follows that 
lim„_>oo Jn < + 0 + = e. This completes the proof. □ 

5 Distributed Classification Without Abstention 

As noted in the introduction, given the results of the previous two sections, it is 
natural to ask whether the communication constraints can be tightened. Let us 
consider a second model in which the agents cannot choose to abstain. In effect, 
each agent communicates one bit per decision. First, let us consider the binary 
classification framework but as a technical convenience, adjust our notation so 
that y = {+1, —1} instead of the usual {0, 1}; also, agents now decide between 
sending ±1. We again consider whether universally Bayes-risk consistent schemes 
exist for the ensemble. 

Let be a family {+1, —1}- valued random variables such that 

P{Z„,|=+l}=i 

Consider the randomized agent decision rule specified as follows: 



(Yi, ifX,eBr„{x) 
otherwise 



(10) 



That is, the agents respond according to their training data if x is sufficiently 
close to Xi. Else, they simply “guess”, flipping an unbiased coin. It is readily 
verified that each agent transmits one bit per decision. 

A natural fusion rule for the central authority is the majority vote. That is, 
the central authority decides according to 



9n{x) 



if Yli=i 5m{x) > 0 

0, otherwise 



( 11 ) 



Of course, the natural performance metric for the ensemble is the probability of 
misclassification. Modifying our convention slightly, let = {{Xi, Yi, i )}(L;^. 
Define 



i?„ = P{5„(A)^r|D„}. (12) 

That is, Rn is the conditional probability of error of the majority vote fusion 
rule conditioned on the randomness in agent training and agent decision rules. 
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5.1 Main Result and Comments 

Assuming an ensemble using the described decision rules, Proposition 2 specifies 
sufficient conditions for consistency. 



Proposition 2. If, as n ^ oo, r„ — >• 0 and r„\/n — >• oo, then — >• R* . 

Yet again, the conditions of the proposition strike a similarity with consis- 
tency results for kernel classifiers using the naive kernel. Indeed, r„ — >■ 0 ensures 
the bias of the classifier decays to zero. However, must not decay too 

rapidly. As the number of agents in the ensemble grows large, many, indeed most, 
of the agents will be “guessing” for any given classification; in general, only a 
decaying fraction of the agents will respond with useful information. In order 
to ensure that these informative bits can be heard through the noise introduced 
by the guessing agents, Vn^/n — >■ oo. Note the difference between the result for 
naive kernel classifiers where r„n — >■ oo dictates a sufficient rate of convergence 
for 

Notably, to prove this result, we show directly that the expected probabil- 
ity of misclassification converges to the Bayes rate. This is unlike techniques 
commonly used to demonstrate the consistency of kernel classifiers, etc., which 
are so-called “plug-in” classification rules. These rules estimate the a posteriori 
probabilities P{Y = i\X}, i = ±1 and construct classifiers based on threshold- 
ing the estimate. In this setting, it suffices to show that these estimates converge 
to the true probabilities in LP(Px)- However, for this model, we cannot estimate 
the a posteriori probabilities and must resort to another proof technique; this 
foreshadows the negative result of Section VI. 

With our choice of “coin dipping” agent decision rules, this model feels much 
like that presented in ’’Learning with an Unreliable Teacher” [15]. Several dis- 
tinctions should be made. While [15] considers the asymptotic probability of 
error of both the 1-NN rule and “plug-in” classification rules, in our model, the 
resulting classifier cannot be viewed as being 1-NN nor plug-in. Thus, the results 
are immediately different. Even so, the noise model considered here is much dif- 
ferent; unlike [15], the noise here is statistically dependent on X, the object to 
be classified, as well as dependent on n. 



5.2 Proof of Proposition 2 



Proof. Fix an arbitrary e > 0. We will show that — R* is less than e 

for all sufficiently large n. Recall from (2) that r]{x) = E{Y \X = x} = P{Y = 
-1-1 \X = x} — P{Y = — 1 [A = x} and define = {x : Jr/^a;)] > |}. Though we 
save the details for the sake of space, it follows from (1), (12), and a series of 
simple expectation manipulations that. 



E{i?„} -R* < p{5„(A) ^ 6b{X) |a g A,}p{a,} 



e 

2 ■ 




452 



J.B. Predd, S.R. Kulkarni, and H.V. Poor 



If P{^e} = 0, then the proof is complete. Proceed assuming > 0. and 

define the quantities 

ninix) = E,{ri{X)SniiX) \X = x} 
crlix) = E{|t 7(X)5™(X) - m„(X)|2 \X = x} , 

with the expectation being taken over the random training data and the random- 
ness introduced by the agent decision rules. Respectively, m„(a;) and cr^{x) can 
be interpreted as the mean and variance of the “margin” of the agent decision 
rule 6„i{X), conditioned on the observation X. For large positive mn{x), the 
agents can be expected to respond “confidently” (with large margin) according 
to Bayes Rule when asked to classify an object x. For large cr^(x), the central 
authority can expect to observe a large variance amongst the individual agent 
responses to x. 

Fix any integer fc > 0. Consider the sequence of sets indexed by n, 

Bn,k = {x : mn{x)n > k^/nan(x)} , 

so that X G Bn,k if and only if > k. We can interpret Bn,k as the set 

of objects for which informed agents have a sufficiently strong signal compared 
with the noise of the guessing agents. One can show that, 

n 

p{gn{X) ^ 6 b{X) |x g ^} < p{v{X) ^m{X) < 0 |x G A, n + 

P{B„,k\X gA,}. (13) 

Note that conditioned on X, r]{X) ^m{X) is a sum of independent and iden- 

tically distributed random variables with mean mn{X) and variance a^{X). Fur- 
ther, for X G r]{x) Yh=i ^m{x) < 0 implies \rj{x) Yh=i ^m{x) - m„(a;)n| > 

ky/na^{x). Thus, using Markov’s inequality, one can show that, 

n . 

p{rj{X)Y,SmiX) <0\X G A,nB„^k} < p . 

i=l 

Thus, the first term in (13) can be made arbitrarily small. Now, let us determine 
specific expressions for m„(a;) and cr^(a;), as dictated by our choice of agent 
decision rules. Algebraic simplification yields, 

mn(x) = 7](x)r]„(x) J lB,^ix){y)Px{dy) 
al{x) = g^{x){l-P{SUX)\X = xr), 
with r]n{x) = E{rj{X) \X G Br„{x)}. 

Substituting these expressions into the second term of (13), it follows that 

P(S„. IX c .4.} ^ p{(sg„(,(X)),„(.Y)) ^ 

J ^Br„{x){y)Px{dy)'^ < k X G Agj . 
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For any 1 > 7 > 0, we have 






(14) 



Vl-E{<5™(X)|X}2 
J ^Br^{x){y)Px{dy) <k X G A„sgn{T]{X))r]n{X) > 7 | 
+ P{sgn(r?(X)) 77 „(X) < 7 |X G AJ . 



Set 7 = I . It follows from our choice of that 

P{sgn(ry(X)),7„(X) < ^\X G A,} < PMX) - ,y„(X)| > ^\X G A,} . 

Since by Lemma 2, r]n{X) -G rj{X) in probability and by assumption P{Glg} > 0, 
it follows from Lemma 1 that P{sgn(7y(X))r7„(X) < l\X G Ag} — >• 0. 

Returning to the first term of ( 14 ), note that we have just demonstrated that 
limP{sgn(?7(X))77„(X) > |} = 1. Thus, by Lemma 1 , it suffices to show that, 

— , = f 1r (x)(v)Px(dy)y/n ^ 00 i.p. (15) 

v/l-E{< 5 ™(X)|X} 2 y ^ ^ ^ 

Since , ^ > 1 , this follows from Lemma 3 and the fact that 

y'l-E{5„i{X)\XV - ’ 

rn\/n -G 00. This completes the proof. □ 



6 Distributed Regression Without Abstention 



Finally, let us consider the model presented in Section V in a regression frame- 
work. Now, 3^ = IR; agents will receive real-valued training data labels Yi values. 
When asked to respond with information, they will reply with either 0 or 1. 
We will demonstrate that universal consistency is not achievable in this one bit 
regime. 

Let A = {a : IR'^ x IR'^ x IR — >■ [0,1]}. That is, A is the collection of 
functions mapping IR^* x X IR to [0,1]. For every sequence of functions 

Wn}'^=i C A, there is a corresponding sequence of randomized agent decision 
rules {Sni{X)}'^^i, specified by 



dni{x) — ^i,an(x,Xi,Yi) ; ( 10 ) 

for i G {1, ...,n|. As before, these agent decision rules depend on n and satisfy 
the same constraints imposed on the decision rules in Section V. 

A central authority fusion rule consists of a sequence of functions 
mapping IR'^ x {0, 1}” to 3^ = IR. To proceed, we require some regularity on 
{fjn}'^=i - Namely, let us consider all fusion rules for which there exists a constant 
C such that 



^ n 1 ^ 

\f]n{x,bi) - f]n{x,h2)\ < C - bu 6 ; 



(17) 



Z=1 
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for all bit strings 61,62 G {0,1}”, all x G and every n. This condition 
essentially amounts to a type of Lipschitz continuity and implies that the fusion 
rule is invariant to the permutation of the bits it receives from the agents. 

For any chosen agent decision rule and central authority fusion rule, the L^- 
risk is the performance metric of choice. Specifically, we will consider E{|r)„(X) — 
rj{X)\‘^}. As before, the expectation is taken over X, Dn = |(Aj, and 

any randomness introduced in the agent decision rules themselves. 

6.1 Main Result 

Assuming an ensemble using the decision rules satisfying the fairly natural con- 
straints stated above. Proposition 3 specifies a negative result. 

Proposition 3. For every sequence of agent decision rules {<5n(a:)}$}Li specified 
according to (16) with a converging sequence of functions |o„}^i C A, there is 
no combining rule {f]n}(fLi satisfying (17) such that 

lim E{|t 7„(A, |6 „,(A)}”^i) - p{X)\^} = 0 (18) 

n—¥oo 

for every distribution Pxv- 

6.2 Proof of Proposition 3 

The proof will proceed by specifying two random variables {X,Y) and {X' ,Y') 
with r]{x) = E|F |A = x| yf EjE' |A' = x} = rj'{x). Asymptotically, how- 
ever, the central authority’s estimate will be indifferent to whether the agents 
are trained with random data distributed according to Pxy or Px'y'- This 
observation will contradict universal consistency and complete the proof. 

Proof. To start, fix a convergent sequence of functions C A, arbitrary 

a;o,xi G IR"’*, and distinct yo,yi G IR. Let us specify a distribution Pxy- Let 
Pjcja^oj = q, Px|a:i} = I - q, and Py\x{y = Vi\X = Xi} = \ for i = 0, 1. 
Clearly, for this distribution rj{xi) = yt for z = 0, 1. 

Suppose that the ensemble is trained with random data distributed according 
to (A, Y) and that the central authority wishes to classify X = xq. According to 
the model, after broadcasting X to the agents, the central authority will observe 
a random sequence of n bits. For all i G (1, ..., n| and all n, 

P{5ni{X,X^,Yi) = l\X = xo} = an(xo,xo,yo)q + an{xo,xi,yi){l - q) . 

Define a sequence of auxiliary random variables, {{X(,Y')'\)f^i, with distribu- 
tions satisfying 

p ^ r 1 ^ an{xo,xo,yo)q + a„{xo,xi,yi){l - q) - a„(xo, xi, yi) 
Uni^XQ ^ Xq yf) Cyt(xo,Xl,^o) 

Px;{a:o} = 1 - Px'^{xi} 

Py'\x'„[Y' = vi-i \X( = xi) ^ I, z = 0, 1 . 
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Here, rj'{xi) = E{y' \X'^ = Xi} = yi-i- Note that if the ensemble were trained 
with random data distributed according to then we would have 



p{<5„i(x;,xN,rb) = i|x; = xo} 

^an{xo,xo,yo)q + a„{xo,xi,yi){l - q) - a„{xo,xi,yi) 



— ^0; 2/l) 



+an{xo,Xi,yo){l - 



an{xo,xo,yi) - an{xo,xi,yo) 

a„{xo,xo,yo)q + a„{xo,xi,yi){l - q) - a„{xo,xi,yi) 



i{xo,xo,yi) - an{xo,xi,yo) 



= P{Sm{X,X„Yi) = l|X = xo}, 



for all n. Thus, conditioned on X and X'^, the central authority will observe an 
identical stochastic process regardless of whether the ensemble was trained with 
data distributed according to P xy or Px'^v for any fixed n. Note, this is true 
despite the fact that iq{x) yf rj'{x). Finally, let {X',Y') be such that 



'Px'{xi} = lim Px'^{xi} 

n—¥oo 

Px'ia^o} = 1 — Px'{a;i} 

Py,\x'{Y’ = yi-^ \X' = Xi} = l, i = 0,l. 

Again, rj'{xi) = E{F' \X' = Xi} = yi-i- By definition, for the ensemble to 
be universally consistent, both E{|t 7 „(X) — r]{X)\'^} — >• 0 and E{|r)„(A') — 
77 ' (A') p} — >■ 0. However, assuming the former holds, we can show that necessar- 
ily, E{|? 7 „(A') — rj{X')\‘^} — >• 0. Since rj{x) yf v'ix), this presents a contradiction 
and completes the proof; the details are left for the full paper. □ 
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A Technical Lemmas 

The following lemmas can be found in various forms in [2], [3], and [16]. 

Lemma 1. Suppose {Xn}^^i is a sequence of random variables such that X„ — >■ 
X i.p. Then, for all e > 0 and any sequence with liminf P{A„} > 0, 

P{|X„-X| >e|A„}^0. 



Lemma 2. Fix an -valued random variable X and a measurable function f. 
For an arbitrary sequence of real numbers {rn}^^i, define a sequence of functions 
fn{x) = E{f{X) \X G Ifr„ -)> 0, then fn{X) -)> f{X) in probability. 



Lemma 3. Suppose X is an TR'^-valued random variable and and 

Wn}'^=i axe sequences of real numbers with — >■ 0 and a„ — >■ oo. If r nan oo, 

then 






^Br„(x){y)Px{dy) -)> oo i.p. 



Lemma 4. There is a constant c such that Vn and any measurable function f , 



f ELl ^{Xj(^Br„(X)}f{Xi) 

^ E”=l 



} < cE{f{X)} 
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Abstract. Given a set of n randomly drawn sample points, spectral 
clustering in its simplest form uses the second eigenvector of the graph 
Laplacian matrix, constructed on the similarity graph between the sam- 
ple points, to obtain a partition of the sample. We are interested in the 
question how spectral clustering behaves for growing sample size n. In 
case one uses the normalized graph Laplacian, we show that spectral clus- 
tering usually converges to an intuitively appealing limit partition of the 
data space. We argue that in case of the unnormalized graph Laplacian, 
equally strong convergence results are difficult to obtain. 



1 Introduction 

Clustering is a widely used technique in machine learning. Given a set of data 
points, one is interested in partitioning the data based on a certain similarity 
among the data points. If we assume that the data is drawn from some underly- 
ing probability distribution, which often seems to be the natural mathematical 
framework, the goal becomes to partition the probability space into certain re- 
gions with high similarity among points. In this setting the problem of clustering 
is two- fold: 

— Assuming that the underlying probability distribution is known, what is a 
desirable clustering of the data space? 

— Given finitely many data points sampled from an unknown probability dis- 
tribution, how can we reconstruct that optimal partition empirically on the 
finite sample? 

Interestingly, while extensive literature exists on clustering and partitioning, to 
the best of our knowledge very few algorithms have been analyzed or shown to 
converge for increasing sample size. Some exceptions are the k-means algorithm 
(cf. Pollard, 1981), the single linkage algorithm (cf. Hartigan, 1981), and the 
clustering algorithm suggested by Niyogi and Karmarkar (2000). The goal of 
this paper is to investigate the limit behavior of a class of spectral clustering 
algorithms. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 457—471, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Spectral clustering is a popular technique going back to 

Donath and Hoffman (1973) and Fiedler (1973). It has been used for 
load balancing (Van Driessche and Roose, 1995), parallel computations 
(Hendrickson and Leland, 1995), and VLSI design (Hagen and Kahng, 1992). 
Recently, Laplacian-based clustering algorithms have found success in ap- 
plications to image segmentation (cf. Shi and Malik, 2000). Methods based 
on graph Laplacians have also been used for other problems in machine 
learning, including semi-supervised learning (cf. Belkin and Niyogi, to appear; 
Zhu et ah, 2003). While theoretical properties of spectral clustering have been 
studied (e.g., Guattery and Miller (1998), Weiss (1999), Kannan et al. (2000), 
Meila and Shi (2001), also see Chung (1997) for a comprehensive theoretical 
treatment of the spectral graph theory), we do not know of any results dis- 
cussing the convergence of spectral clustering or the spectra of graph Laplacians 
for increasing sample size. However for kernel matrices, the convergence of 
the eigenvalues and eigenvectors has already attracted some attention (cf. 
Williams and Seeger, 2000; Shawe-Taylor et al., 2002; Bengio et al., 2003). 

2 Background and Notations 

Let (V,dist) be a metric space, B the Borel cr-algebra on V, P a probability 
measure on (X,B), and L 2 {P) := L 2 {X ,B, P) the space of square-integrable 
functions. Let fc : dfxV — >■ IR a measurable, symmetric, non-negative func- 
tion that computes the similarity between points in X. For given sample 
points Xi,...,Xn drawn iid according to the (unknown) distribution P we 
denote the empirical distribution by P„. We define the similarity matrix as 
Kn ■= (fc(Vj, and the degree matrix D„ as the diagonal matrix 

with diagonal entries di := k{Xi, Xj). The unnormalized discrete Lapla- 

cian matrix is defined as P„ := — K„. For symmetric and non-negative k, 

is a positive semi-definite linear operator on IR". Let a = (ai, ..., a„) the second 
eigenvector of Here, “second eigenvector” refers to the eigenvector belonging 
to the second smallest eigenvalue, where the eigenvalues Ai < A 2 ... < A„ are 
counted with multiplicity. In a nutshell, spectral clustering in its simples form 
partitions the sample points {Xi)i into two (or several) groups by thresholding 
the second eigenvector of point Xi belongs to cluster 1 if Oj > b, and to 
cluster 2 otherwise, where & G IR is some appropriate constant. An intuitive 
explanation of why this works is discussed in Section 4. 

Often, spectral clustering is also performed with a normalized version of the 
matrix L„. Two common ways of normalizing are := or 

L" := D~^Ln- The eigenvalues and eigenvectors of both matrices are closely 
related. Define the normalized similarity matrices := KnDn^^^ and 

H" := D~^Kn- It can be seen by multiplying the eigenvalue equation L'^v = \v 
from left with ' that v G IR" is eigenvector of L'^ with eigenvalue A iff 
Dn ' V is eigenvector of L" with eigenvalue A. Furthermore, rearranging the 
eigenvalue equations for L'^ and P" shows that v G IR" is an eigenvector of L'^ 
with eigenvalue A iff u is eigenvector of P['„ with eigenvalue (1 — A), and that 
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V G M" is an eigenvector of L" with eigenvalue A iff u is eigenvector of iJ" 
with eigenvalue (1 — A). Thus, properties about the spectrum of one of the ma- 
trices L", or If" can be reformulated for the three other matrices as well. 

In the following we want to recall some definitions and facts from per- 
turbation theory for bounded operators. The standard reference for general 
perturbation theory is Kato (1966), for perturbation theory in Hilbert spaces 
we also recommend Birman and Solomjak (1987) and Weidmann (1980), and 
Bhatia (1997) for finite-dimensional perturbation theory. We denote by cr(T) the 
spectrum of a linear operator T. Its essential and discrete spectra are denoted 
by (Tess(T) and CTd(T), respectively. 

Proposition 1 (Spectral and perturbation theory). 

1. Spectrum of a compact operator: Let T a compact operator on a Banach 
space. Then o{T) is at most countable and has at most one limit point, 
namely 0. If 0 ^ X G <j{T), then A is an isolated eigenvalue with finite 
multiplicity. The spectral projection corresponding to A coincides with the 
projection on the corresponding eigenspace. 

2. Spectrum of a multiplication operator: For a bounded function g G 
Loo{P) consider the multiplication operator Mg : L 2 {P) — f L 2 {P), f i— f gf. 
Mg is a bounded linear operator whose spectrum coincides with the essential 
range of the multiplier g. 

3. Perturbation of symmetric matrices: Let A and B be two symmet- 
ric matrices in IR"^”, and denote || • || an operator norm on M"^". 
Then the Hausdorff distance d{a{A),a{B)) between the two spectra satis- 
fies d{u{A),a{B)) < \\A — B\\. Let p,\ > ... > pLk be the eigenvalues of A 
counted without multiplicity and Pri,...,Prfc the projections on the corre- 
sponding eigenspaces. For 1 < r < k define the numbers 

')r{A) := min{|/Ti - ^ < i < j < r -\- 1}. 

Assume that ||B|| < e. Then for all 1 < I < r we have 

\\Fn{A + B)-FrfiA)\\<A^ 

(cf. Section VI. 3 of Bhatia, 1997, Lemma A.l.(iii) of Koltchinskii, 1998, 
and Lemma 5.2. of Koltchinskii and Gine, 2000). 

j. Perturbation of bounded operators: Let {Tn)n md T be bounded opera- 
tors on a Banach space E with T„ — >■ T in operator norm, and A an isolated 
eigenvalue of T with finite multiplicity. Then, for n large enough, there exist 
isolated eigenvalues A„ G cr(Tn) such that A„ — f A, and the corresponding 
spectral projections converge in operator norm. The other way round, for a 
converging sequence A„ G clTn) of isolated eigenvalues with finite multiplic- 
ity, there exists an isolated eigenvalue A G tr(T) with finite multiplicity such 
that A„ — 1- A and the corresponding spectral projections converge in operator 
norm (cf. Theorems 3.16 and 2.23 in Kato, 1966). 
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5. Perturbation of the essential spectrum: Let A be a bounded and V a 
compact operator on some Banach space. Then Oess{A + V) = aess(A) (cf. 
Th. 5.35 in Kato, 1966, and Th. 9.1.3 in Birman and Solomjak, 1987). 

Finally we will need the following definition. A set T of real-valued functions on 
X is called a P-Glivenko-Cantelli class if 

sup I [ fdPn - [ fdP\ -)> 0 P-a.s. 

J J 

3 Convergence of the Normalized Laplacian 

The goal of this section is to prove that the first eigenvectors of the normalized 
Laplacian converge to the eigenfunctions of some limit operator on L 2 {P). 

3.1 Definition of the Integral Operators 

Let d{x) := J k{x,y)dP{y) the “true degree function” on X, and dn{x) := 
f k{x,y)dPn{y) the empirical degree function. To ensure that 1/d is a bounded 
function we assume that there exists some constant I such that d{x) > I > 0 for 
all X G X. We define the normalized similarity functions 

hn{x,y) := k{x,y)/^/dn{x)dn{y) 

h{x, y) := k{x, y)/ d{x)d{y) (1) 

and the operators 

Tn ■■ L 2 {Pn) L 2 {Pn), T^f^x) = J h{x ,y) f {y)dPn{y) 

T/ : L 2 {Pn) ^ L 2 {Pn), T'J {x) = J h^{x,y)f{y)dP^{y) 

T : L 2 {P) ^ L 2 {P), Tf{x) = J h{x,y)f{y)dP{y). ( 2 ) 

If k is bounded and d > I > 0, then all three operators are bounded, compact 
integral operators. Note that the scaling factors 1/n which are hidden in 
and Pn cancel. Hence, because of the isomorphism between L 2 {Pn) and IR”, 
the eigenvalues and eigenvectors of T), can be identified with the ones of the 
empirical similarity matrix and the eigenvectors and values of T„ with those 
of the matrix := (h{Xi, 

Our goal in the following will be to show that the eigenvectors of PI/ converge 
to those of the integral operator T. The first step will consist in proving that 
the operators Tn and T/ converge to each other in operator norm. By perturba- 
tion theory results this will allow us to conclude that their spectra also become 
similar. The second step is to show that the eigenvalues and eigenvectors of 
converge to those of T. This step uses results obtained in Koltchinskii (1998). 
Both steps together then will show that the first eigenvectors of the normalized 
Laplacian matrix converge to the first eigenfunctions of the limit operator T, 
and hence that spectral clustering converges. 
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3.2 Tn and Converge to Each Other 

Proposition 2 (d„ converges to d uniformly on the sample). Let k : 

df xX — >■ IR &e hounded. Then \dn{Xi) — d{Xi)\ — >■ 0 a.s. for n — 1- oo. 

Proof. With M := ||A:||oo < oo we have 



max \dn{Xi) - d{Xi)\ 



max I — 

71 



Y^k{x,,x^) 



Exk{X,,X)\ 



2M 

n 



max k{Xi,Xj) - Exk{Xi,X)\. 

71 i 71— I ^ 



For fixed x € X, the Hoeffding inequality yields 

p(^\ — ^ k{x, Xj) — Exk{x, X)| > < exp(— M(n — l)e^). 

3¥=i 



The same is true conditionally on Xi if we replace xhy Xi, because the random 
variable Xi is independent of Xj for j ^ i. Applying the union bound and taking 
expectations over Xi leads to 

P{ max \^^yk{X„X3)-Exk{Xi,X)\>£) 

" / 1 

k{X„X,)-Exk{Xi,X)\>e 

i=l ^ ^ 

< nexp(— M(n — l)e^). 




This shows the convergence of maxj=i^..,^„ \dn{Xi) — d{Xi)\ — >• 0 in probability. 
As the deviations decrease exponentially, the Borel-Cantelli lemma shows that 
this convergence also holds almost surely. © 



Proposition 3 {\\Tf^ — Tn\\L 2 (Pr^} converges to 0). Let k a bounded similarity 
function. Assume that there exist constants u > I > 0 such that u > d{x) > I > 0 
for all X € X. Then ||T„ — Tj^|lL 2 (Pn) ^ ll^« ~ E'„\\n — f 0 a.s., where 

II • ||n denotes the row sum norm for nxn-matrices. 



Proof. By the Cauchy-Schwartz inequality, 



\\Tn 



TLW 



2 

L2{Pn) 



sup 

ll/IU2(Pn)<l 



{hn{x,y) - h{x,y))f{y)dPn{y) 



2 

dP„{x) 



< sup [ f{hn{x,y) - h{x,y)fdPn{y) [ f{y)dPn{y) dPn{x) 



< J j {hn{x,y) - h{x,y)YdPn{y)dPn{x) 

< max \hn{X^,Xj) - h{Xi,Xj)\^ 
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By Proposition 2 we know that for each e > 0 there exists some N such that for 
all n> N, \dn{x) — d{x)\ < e for all x G {Xi, Jf„}. Then 



\dn{x)dn{y)-d{x)d{y)\ < \dn{x)dn{y)-d{x)dn{y)\ + \d{x)dn{y)-d{x)d{y)\ < 2us, 
which implies that \\/ dn{x)dn{y) — \/ d{x)d{y)\ < y/2ue. This finally leads to 



1 1 




y^dn{x)dn{y) ~ y^d{x)d{y) 


^Jdn{x)dn{y) y^d{x)d{y) 




a/ dn{x)dn{y)\/d{x)d{y) 



< 



^/2u£ 
l{l — 2ue) 



for all x,y £ {Xi, ...,Xn}. This shows that ||r„ — T^\\ converges to 0 almost 
surely. The statement for ||7J„ — H^\\ follows by a similar argument. © 



3.3 Convergence of to T 

Now we want to deal with the convergence of T„ to T. By the law of large num- 
bers it is clear that T„/(cc) — >■ Tf{x) for all x G T and f £ X. But this pointwise 
convergence is not enough to allow any conclusion about the convergence of the 
eigenvalues, let alone the eigenfunctions of the involved operators. On the other 
hand, the best convergence statement we can possibly think of would be conver- 
gence of T„ to T in operator norm. Here we have the problem that the operators 
Tn and T are not defined on the same spaces. One way to handle this is to relate 
the operators T„, which are currently defined on L 2 {Pn), to some operators 
on the space L 2 {P) such that their spectra are preserved. Then we would have 
to prove that converges to T in operator norm. We believe that such a state- 
ment cannot be true in general. Intuitively, the reason for this is the following. 
Convergence in operator norm means uniform convergence on the unit ball of 
L 2 {P). Independent of the exact definition of 5'„, the convergence of S'„ to T in 
operator norm is closely related to the problem 

sup II [ k{x,y)f{y)dPn{y) - [ k{x,y)f{y)dP{y) || -b 0. 

This statement would be true if the class Q := {fc(x, •)/(•); x £ X, ||/|| < 1} 
was a P-Glivenko-Cantelli class, which is false in general. This can be made 
plausible by considering the special case k = 1. Then the condition would be 
that the unit ball of p 2 {P) is a Glivenko-Cantelli class, which is clearly not the 
case for large enough X . As a consequence, we cannot hope to achieve uniform 
convergence over the unit ball of L 2 {P). 

A way out of this problem might be not to consider uniform convergence 
on the whole unit ball, but on a smaller subset of it. Something of a similar 
flavor has been proved in Koltchinskii (1998). To state his results we first have 
to introduce some more notation. For a function f : X ^ TR denote its restric- 
tion to the sample points by /. Let ft, : Ax A — >■ IR a symmetric, measurable 
similarity function such that E{h^{X,Y)) < oo. This condition implies that the 
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integral operator T with kernel is a Hilbert-Schmidt operator. Let (Ai)jg/ its 
eigenvalues and a corresponding set of orthonormal eigenfunctions. To 

measure the distance between two countable sets A = (oi)igN, B = {hi)i^n, we 
introduce the minimal matching distance 5{A, B) := inf.„. X)i=i ~ ^Tr(i)) where 
the infimum is taken over the set of all permutations tt of N. A more general 
version of the following theorem has been proved in Koltchinskii (1998). 

Theorem 4 (Koltchinskii). Let (X,B,P) an arbitrary probability space, h : 
A xA — >■ IR a symmetric, measurable function such that E{hf{X,Y)) < oo and 
E{\h{X, X)\) < CO, and Tn and T the integral operators as defined in equation 

(2). Let the eigenfunctions ofT, and let X 0 the r-th largest eigen- 

value ofT (counted without multiplicity) . Denote by Pr and Pr„ the projections 
on the eigenspaces corresponding to the r-th largest eigenvalues of T and Tn, 
respectively. Then: 

1. 6{a{Tn),a{T)) -)> 0 a.s. 

2. Suppose that Q is a class of measurable functions on A with a square- 

integrable envelope G with ||G||l 2 (p) — l5(2^)l ^ g & Q- 

Moreover, suppose that for all i G I, the set '■= {gd^f, g G G} is a 
P-Glivenko Cantelli class. Then 



sup 

/.see 



(Pi-n f,9)L2(Pn) - (Pr/,5)i2(P) 



-G 0 a.s. for n — >■ oo. 



Coming back to the discussion from above, we can see that this theorem 
also does not state convergence of the spectral projections uniformly on the 
whole unit ball of L 2 {P), but only on some subset G of it. The problem that the 
operators and T are not defined on the same space has been circumvented 
by considering bilinear forms instead of the operators themselves. 



3.4 Convergence of the Second Eigenvectors 

Now we have collected all ingredients to discuss the convergence of the second 
largest eigenvalue and eigenvector of the normalized Laplacian. To talk about 
convergence of eigenvectors only makes sense if the eigenspaces of the corre- 
sponding eigenvalues are one-dimensional. Otherwise there exist many different 
eigenvectors for the same eigenvalue. So multiplicity one is the assumption we 
make in our main result. In order to compare an eigenvector of the discrete op- 
erator Tf and the corresponding eigenfunction of T, we can only measure how 
distinct they are on the points of the sample, that is by the i 2 (-Pn)-distance. 
However, as eigenvectors are only unique up to changing their orientations we 
will compare them only up to a change of sign. 

Theorem 5 (Convergence of normalized spectral clustering). Let 

{X,B,P) a probability space, k : Ax A — >■ IR a symmetric, bounded, measurable 
function, and (Ai)ig]N a sequence of data points drawn iid from X according to 
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P. Assume that the degree function satisfies d{x) > ^ > 0 for all x G X and some 
constant I G M. Denote by X ^ 0 the second largest eigenvalue of T (counted 
with multiplicity) , and assume that it has multiplicity one. Let <L> he the corre- 
sponding eigenfunction, and Pr the projection on <P. Let A„, <?„ and Pr„ the 
same quantities for T„, and AJj, and Pr'„ the same for T(. Then there exists 
a sequence of signs (a„)„ with a„ G {— 1,+1} such that — ^I|l 2 (p„) 0 

almost surely. 

Proof. The boundedness of k and d{x) > I > 0 imply that the normalized 
similarity function h is bounded. Hence, the operators T, T„ and Tf are com- 
pact operators. By Proposition 1.1, their non-zero eigenvalues are isolated in 
their spectra, and their spectral projections correspond to the projections on 
the eigenspaces. Moreover, the boundedness of h implies E{hf{X,Y)) < oo and 
E\h{X,X)\ < oo. Theorem 4 shows A„ — >■ A for n — >■ oo, and choosing P = {<P} 
we get 

= ((^„, <?)<?„, ^) = (Pr„ <?,<?) ^ = (<?,<?) = 1. 

The eigenfunctions <P and are normalized to 1 in their respective spaces. By 
the law of large numbers, we also have ||^||l 2 (p„) 1 Hence, (<?„,<?) — >■ 1 

or —1 implies the L 2 (H„) -convergence of to up to a change of sign. 

Now we have to compare AJj to A„ and to <?„. In Proposition 3 we showed 
that \\Tf — Tn\\ — >■ 0 a.s., which according to Proposition 1.3 implies the conver- 
gence of A^ — A„ to zero. Theorem 4 implies the convergence of A„ — A to zero. 
For the convergence of the eigenfunctions, recall the definition of ■jr in Propo- 
sition 1.3. As the eigenvalues of T are isolated we have J 2 {T) > 0, and by the 
convergence of the eigenvalues we also get | 72 (Tj^) — 72 (T)| — >■ 0. Hence, ^{Tf) 
is bounded away from 0 simultaneously for all large n. Moreover, we know by 
Proposition 3 that — T„|| — >• 0 a.s. Proposition 1.3 now shows the convergence 
of the spectral projections ||Pr^ — Pr„ || — >■ 0 a.s. This implies in particular that 

sup {v, (Pr„ — PrJj)u) — >■ 0 and thus sup | {v, — {v, \ 0. 

Ihll<i lhll<i 

Since |a^ — &^ | = |a — 6| |a -I- 6|, we get the convergence of to ^ up to a change 
of sign on the sample, as stated in the theorem. This completes the proof. © 

Let us briefly discuss the assumptions of Theorem 5. The symmetry of k is 
a standard requirement in spectral clustering as it ensures that all eigenvalues 
of the Laplacian are real-valued. The assumption that the degree function 
is bounded away from 0 prevents the normalized Laplacian from getting 
unbounded, which is also desirable in practice. This condition will often be 
trivially satisfied as the second standard assumption of spectral clustering is 
the non-negativity of A: (as it ensures that the eigenvalues of the Laplacian 
are non-negative). An important assumption in Theorem 5 which is not 
automatically satisfied is that the second eigenvalue has multiplicity one. But 
note that if this assumption is not satisfied, spectral clustering will produce 
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more or less arbitrary results anyway, as the second eigenvector is no longer 
unique. It then depends on the actual implementation of the algorithm which 
of the infinitely many eigenvectors corresponding to the second eigenvalue 
is picked, and the result will often be unsatisfactory. Finally, note that even 
though Theorem 5 is stated in terms of the second eigenvalue and eigenvector, 
analogous statements are true for higher eigenvalues, and also for spectral 
projections on finite dimensional eigenspaces with dimension larger than 1 . 

To summarize, all assumptions in Theorem 5 are already important for suc- 
cessful applications of spectral clustering on a finite sample. Theorem 5 now 
shows that with no additional assumptions, the convergence of normalized spec- 
tral clustering to a limit clustering on the whole data space is guaranteed. 



4 Interpretation of the Limit Partition 

Now we want to investigate whether the limit clustering partitions the data 
space T in a desirable way. In this section it will be more convenient to consider 
the normalized similarity matrix H” instead of H'^ as it is a stochastic matrix. 
Hence we consider the normalized similarity function g{x,y) := k{x,y)/d{x), its 
empirical version g„{x,y) := k{x,y)/dn{x), and the integral operators 

R'n ■ L2{Pn) -f L2{Pn), R'nfix) = j gn{x , y) f (y)dP„{y) 

R ■■ L2{P) -f L2{P), Rf{x) = J g{x,y)f{y)dP{y). 

The spectrum of i?" coincides with the spectrum of iJ", and by the one-to-one 
relationships between the spectra of H” and H'^ (cf. Section 2 ), the convergence 
stated in Theorem 5 for and T holds analogously for the operators i?" and R. 

Let us take a step back and reflect what we would like to achieve with spectral 
clustering. The overall goal in clustering is to find a partition of X into two (or 
more) disjoint sets X\ and X2 such that the similarity between points from 
the same set is high while the similarity between points from different sets is 
low. Assuming that such a partition exists, how does the operator R look like? 
Let X = Ai U A2 be a partition of the space X into two disjoint, measurable 
sets such that P{Xi D X2) = 0 . As a-algebra on Xi we use the restrictions 
Bi := {BtlXi; B G B} of the Borel cr-algebra B on X. Define the measures Pi as 
the restrictions of P to Bi. Now we can identify the space L2{X, B, P) with the 
direct sum L2{Xi,B\, Pi)(Bp2{X2,B2, Dz)- Each function / G L2{X) corresponds 
to a tuple (/i, /2) G L2{Xi) (B L2{X2) , where /^ : — >• IR is the restriction of / 

f Rii Ri 2 \ 

to Xi. The operator R can be identified with the matrix ( d 1 acting on 

\^it2i R22J 

L2{Xi,Bi,Pi) © L2{X2,B2, P2)- We denote by di the restriction of d to Xi and 
by gij the restriction of g to XiX Xj . With these notations, the operators Rij for 
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i,j = 1, 2 are defined as 

Rij : L2{Xj) -)> L2{Xi), Rijfjix) = j g^j{x,y)fj{y)dPj{y). 

Now assume that our space is ideally clustered, that is the similarity function 
satisfies k{xi,X 2 ) = 0 for all x\£Xi and X 2 &X 2 , and k{xi,x'^) > 0 for Xi,x'^ £ Xi 

or Xi,x[ G X 2 - Then the operator R has the form ^ ). It has eigenvalue 

\ U R 22 J 

1 with multiplicity 2, and the corresponding eigenspace is spanned by the 
vectors (lb,F) and (F,lb). Hence, all eigenfunctions corresponding to eigenvalue 
1 are piecewise constant on the sets X\,X 2 , and the eigenfunction orthogonal 
to the function (lb, lb) has opposite sign on both sets. Thresholding the second 
eigenfunction will recover the true clustering X\ U A 2 . When we interpret the 
function g as a Markov transition kernel, the operator R describes a Markov 
diffusion process on X. We see that the clustering constructed by its second 
eigenfunction partitions the space into two sets such that diffusion takes place 
within the sets, but not between them. 

The same reasoning also applies to the finite sample case, cf. 
Meila and Shi (2001), Weiss (1999), and Ng et al. (2001). We split the finite 
sample space {Xi, ...,X„} into the two sets := {Xi, ...,X„}nA), and define 

^ij,n • L2{,Xj^n) ^ T2(Tj j^), Rijjifj(^X^ — yij^ni^X ^ y^ f ji^y^dPj^ni^y^ ■ 

According to Meila and Shi (2001), spectral clustering tries to find a par- 
tition such that the probability of staying within the same cluster is large 
while the probability of going from one cluster into another one is low 
(Meila and Shi, 2001). So both in the finite sample case and in the limit case a 
similar interpretation applies. This shows in particular that the limit clustering 
accomplishes the goal of clustering to partition the space into sets such that the 
within similarity is large and the between similarity is low. 



In practice, the operator R will usually be irreducible, i.e. there will exist 
no partition such that the operators R 12 and i? 2 i vanish. Then the goal will be 
to find a partition such that the norms of i?i 2 and R 21 are as small as possible, 
while the norms of Ru should be reasonably large. If we find such a partition, 

then the operators R ) close in operator norm and 

according to perturbation theory have a similar spectrum. Then the partition 
constructed by R will be approximately the same as the one constructed by 



0 \ 
^ 0 R22) 



which is the partition Xi U A 2 . 



The convergence results in Section 3 show that the first eigenspaces of 
Rn converge to the first eigenspaces of the limit operator R. This statement 
can be further strengthened by proving that each of the four operators Rij^n 
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converges to its limit operator Rij compactly, which can be done by methods 
from von Luxburg et al. . As a consequence, also the eigenvalues and eigenspaces 
of the single operators converge. This statement is even sharper than the 
convergence statement of to R. It shows that for any fixed partition of A, the 
structure of the operator R„ is preserved when taking the limit. This means that 
a partition that has been constructed on the finite sample such that the diffu- 
sion between the two sets is small also keeps this property when we take the limit. 



5 Convergence of the Unnormalized Laplacian 

So far we always considered the normalized Laplacian matrix. The reason is 
that this case is inherently simpler to treat than the unnormalized case. In the 
unnormalized case, we have to study the operators 

Unf(x) := J k{x,y){f{x) - f{y))dPn{y) = dn{x)f{x) - J k{x,y)f{y)dPn{y) 

Uf{x) := J k{x,y){f{x) - f{y))dP{y) = d{x)f{x) - J k{x,y)f{y)dP{y). 

It is clear that C/„ is the operator corresponding to the unnormalized Laplacian 
-L„, and U is its pointwise limit operator for n — >■ oo. In von Luxburg et al. 
we show that under mild assumptions, C/„ converges to U compactly. Compact 
convergence is a type of convergence which is a bit weaker than operator norm 
convergence, but still strong enough to ensure the convergence of eigenvalues 
and spectral projections (Chatelin, 1983). But there is a big problem related 
to the structure of the operators C/„ and U. Both consist of a difference of two 
operators, a bounded multiplication operator and a compact integral operator. 
This is bad news, as multiplication operators are never compact. To the con- 
trary, the spectrum of a multiplication operator consists of the whole range of 
the multiplier function (cf. Proposition 1.2). Hence, the spectrum of U consists 
of an essential spectrum which coincides with the range of the degree function, 
and possibly some discrete spectrum of isolated eigenvalues (cf. Proposition 1.5). 

This has the consequence that although we know that [/„ converges to 
C/ in a strong sense, we are not able to conclude anything about the con- 
vergence of the second eigenvectors. The reason is that perturbation theory 
only allows to state convergence results for isolated parts of the spectra. So 
we get that the essential spectrum of [/„ converges to the essential spectrum 
of U. Moreover, if a{U) has a non-empty discrete spectrum, then we can 
also state convergence of the eigenvalues and eigenspaces belonging to the 
discrete spectrum. But unfortunately, it is impossible to conclude anything 
about the convergence of eigenvalues that lie inside the essential spectrum of 
U . In von Luxburg et al. we actually construct an example of a space X and 
a similarity function k such that all non-zero eigenvalues of the unnormalized 
Laplacian indeed lie inside the essential spectrum of U. Now we have the 
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problem that given a finite sample, we cannot detect whether the second 
eigenvalue of the limit operator will lie inside or outside the essential spectrum 
of U, and hence we cannot guarantee that the second eigenvectors of the 
unnormalized Laplacian matrices converge. All together this means that 
although we have strong convergence results for C/„, without further knowledge 
we are not able to draw any useful conclusion concerning the second eigenvalues. 

On the other hand, in case we can guarantee the convergence of unnormal- 
ized spectral clustering (i.e., if the second eigenvalue is not inside the essential 
spectrum), then the limit partition in the unnormalized case can be interpreted 
similarly to the normalized case by taking into account the form of the operator 
U on L2{Xi,B\, P\)(BL2{X2,B2, ^2)- Similar to above, it is composed of a matrix 
of four operators (C/ij)i ,j=i,2 defined as 

Uii : L2{Xi) -)> L2{Xi), Uiifi(x) = di{x)fi{x) - J ku{x,y)fi{y)dPi{y) 

Uij : L2{Xj) -)> L2{Xi), Uijfj{x) = - J kij{x,y)fj{y)dPj{y) ( for i ^ j). 

We see that the off-diagonal operators Uij for i ^ j only consist of integral 
operators, whereas the multiplication operators only appear in the diagonal 
operators Uu. Thus the operators Uij for i ^ j can also be seen as diffusion 
operators, and the same interpretation as in the normalized case is possible. If 
there exists a partition such that k{x\,X2) = 0 for all xi € X\ and X2 € X2, 
then the second eigenfunction is constant on both parts, and thresholding this 
eigenfunction will recover the “true” partition. Thus, also in the unnormalized 
case the goal of spectral clustering is to find partitions such that the norms of 
the off-diagonal operators is small and the norms of the diagonal operators are 
large. This holds both in the discrete case and in the limit case, but only if the 
second eigenvalue of U is not inside the range of the degree function. 

To summarize, from a technical point of view the eigenvectors of the unnor- 
malized Laplacian are more unpleasant to deal with than the normalized ones, as 
the limit operator has a large essential spectrum in which the interesting eigen- 
values could be contained. But if the second eigenvalue of the limit operator is 
isolated, some kind of diffusion interpretation is still possible. This means that if 
unnormalized spectral clustering converges, then it converges to a sensible limit 
clustering. 

6 Discussion 

We showed in Theorem 5 that the second eigenvector of the normalized 
Laplacian matrix converges to the second eigenfunction of some limit operator 
almost surely. The assumptions in this theorem are usually satisfied in practical 
applications. This allows to conclude that in the normalized case, spectral 
clustering converges to some limit partition of the whole space which only 
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depends on the similarity function k and the probability distribution P. We 
also gave an explanation of how this partition looks like in terms of a diffusion 
process on the data space. Intuitively, the limit partition accomplishes the 
objective of clustering, namely to divide the space into sets such that the 
similarity within the sets is large and the similarity between the sets is low. 

The methods we used to prove the convergence in case of the normalized 
Laplacian fail in the unnormalized case. The reason is that the limit operator 
in the unnormalized case is not compact and has a large essential spectrum. 
Convergence of the second eigenvector in the unnormalized case can be 
proved with different methods using collectively compact convergence of linear 
operators, but only under strong assumptions on the spectrum of the limit 
operator which are not always satisfied in practice (cf. von Luxburg et ah). 
However, if these assumptions are satisfied, then the limit clustering partitions 
the data space in a reasonable way. In practice, the fact that the unnormal- 
ized case seems much more difficult than the normalized case might serve as 
an indication that the normalized case of spectral clustering should be preferred. 

The observations in Section 4 allow to make some more suggestions for 
the practical application of spectral clustering. According to the diffusion 
interpretation, it seems possible to to construct a criterion to evaluate the 
goodness of the partition achieved by spectral clustering. For a good partition, 
the off-diagonal operators R\ 2 ,n and R 2 i,n should have a small norm compared 
to the norm of the diagonal matrices and R 22 ,n, which is easy to check in 

practical applications. It will be a topic for future investigations to work out 
this idea in detail. 

There are many open questions related to spectral clustering which have not 
been addressed in our work so far. The most obvious one is the question about 
the speed of convergence and the concentration of the limit results. Results in 
this direction would enable us to make confidence predictions about how close 
the clustering on the finite sample is to the “true” clustering proposed by the 
limit operator. 

This immediately raises a second question: Which relations are there 
between the limit clustering and the geometry of the data space? For certain 
similarity functions such as the Gaussian kernel kt{x,y) = exp(— ||a: — yW^/t), it 
has been established that there is a relationship between the operator T and 
the Laplace operator on IR” (Bousquet et ah, 2004) or the Laplace-Beltrami 
operator on manifolds (Belkin, 2003) . Can this relationship also be extended to 
the eigenvalues and eigenfunctions of the operators? 

There are also more technical questions related to our approach. The first 
one is the question which space of functions is the “natural” space to study 
spectral clustering. The space L 2 {P) is a large space and is likely to contain 
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all eigenfunctions we might be interested in. On the other hand, for “nice” 
similarity functions the eigenfunctions are continuous or even differentiable, 
thus L 2 {P) might be too general to discuss relevant properties such as relations 
to continuous Laplace operators. Moreover, we want to use functions which are 
pointwise defined, as we are interested in the value of the function at specific 
data points. But of all spaces, the functions in Lp-spaces do not have this 
property. 

Another question concerns the type of convergence results we should prove. 
In this work, we fixed the similarity function k and considered the limit for 
n — >■ oo. As a next step, the convergence of the limit operators with respect to 
some kernel parameters, such as the kernel width t for the Gaussian kernel, can 
be studied as in the works of Bousquet et al. (2004) and Belkin (2003). But it 
seems more appropriate to take limits in t and n simultaneously. This might 
reveal other important aspects of spectral clustering, for example how the kernel 
width should scale with n. 
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Abstract. We consider the problem of estimating an unknown probability dis- 
tribution from samples using the principle of maximum entropy (maxent). To 
alleviate overfitting with a very large number of features, we propose applying the 
maxent principle with relaxed constraints on the expectations of the features. By 
convex duality, this turns out to be equivalent to finding the Gibbs distribution min- 
imizing a regularized version of the empirical log loss. We prove non-asymptotic 
bounds showing that, with respect to the true underlying distribution, this relaxed 
version of maxent produces density estimates that are almost as good as the best 
possible. These bounds are in terms of the deviation of the feature empirical av- 
erages relative to their true expectations, a number that can be bounded using 
standard uniform-convergence techniques. In particular, this leads to bounds that 
drop quickly with the number of samples, and that depend very moderately on the 
number or complexity of the features. We also derive and prove convergence for 
both sequential-update and parallel-update algorithms. Finally, we briefly describe 
experiments on data relevant to the modeling of species geographical distributions. 



1 Introduction 

The maximum entropy (maxent) approach to probability density estimation was first 
proposed by Jaynes [9] in 1957, and has since been used in many areas of computer 
science and statistical learning, especially natural language processing [1,6]. In maxent, 
one is given a set of samples from a target distribution over some space, and a set of known 
constraints on the distribution. The distribution is then estimated by a distribution of 
maximum entropy satisfying the given constraints. The constraints are often represented 
using a set of features (real-valued functions) on the space, with the expectation of every 
feature being required to match its empirical average. By convex duality, this turns out 
to be the unique Gibbs distribution maximizing the likelihood of the samples, where 
a Gibbs distribution is one that is exponential in a linear combination of the features. 
(Maxent and its dual are described more rigorously in Section 2.) 

The work in this paper was motivated by a new application of maxent to the problem 
of modeling the distribution of a plant or animal species, a critical problem in conser- 
vation biology. This application is explored in detail in a companion paper [13]. Input 
data for species distribution modeling consists of occurrence locations of a particular 
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species in a certain region and of environmental variables for that region. Environmental 
variables may include topological layers, such as elevation and aspect, meteorological 
layers, such as annual precipitation and average temperature, as well as categorical lay- 
ers, such as vegetation and soil types. Occurrence locations are commonly derived from 
specimen collections in natural history museums and herbaria. In the context of maxent, 
the sample space is a map divided into a finite number of cells, the modeled distribution 
is the probability that a random specimen of the species occurs in a given cell, samples 
are occurrence records, and features are environmental variables or functions thereof. 

It should not be surprising that maxent can severely overfit training data when the 
constraints on the output distribution are based on feature expectations, as described 
above, especially if there is a very large number of features. For instance, in our ap- 
plication, we sometimes consider threshold features for each environmental variable. 
These are binary features equal to one if an environmental variable is larger than a fixed 
threshold and zero otherwise. Thus, there is a continuum of features for each variable, 
and together they force the output distribution to be non-zero only at values achieved by 
the samples. The problem is that in general, the empirical averages of the features will 
almost never be equal to their true expectation, so that the target distribution itself does 
not satisfy the constraints imposed on the output distribution. On the other hand, we do 
expect that empirical averages will be close to their expectations. In addition, we often 
have bounds or estimates on deviations of empirical feature averages from their expec- 
tations (empirical error bounds). In this paper, we propose a relaxation of feature-based 
maxent constraints in which we seek the distribution of maximum entropy subject to the 
constraint that feature expectations be within empirical error bounds of their empirical 
averages (rather than exactly equal to them). 

As was the case for the standard feature-based maxent, the convex dual of this 
relaxed problem has a natural interpretation. In particular, this problem turns out to 
be equivalent to minimizing the empirical log loss of the sample points plus an £i- 
style regularization term. As we demonstrate, this form of regularization has numerous 
advantages, enabling the proof of meaningful bounds on the deviation between the 
density estimate and the true underlying distribution, as well as the derivation of simple 
algorithms for provably minimizing this regularized loss. Beginning with the former, we 
prove that the regularized (empirical) loss function itself gives an upper bound on the 
log loss with respect to the target distribution. This provides another sensible motivation 
for minimizing this function. More specifically, we prove a guarantee on the log loss 
over the target distribution in terms of empirical error bounds on features. Thus, to get 
exact bounds, it suffices to bound the empirical errors. For finite sets of features, we can 
use Chernoff bounds with a simple union bound; for infinite sets, we can choose from an 
array of uniform-convergence techniques. For instance, for a set of binary features with 
VC-dimension d, if given m samples, the log loss of the relaxed maxent solution on the 
target distribution will be worse by no more than 0(|| A*|| i A/dln(m^/(i)/TO) compared 
to the log loss of any Gibbs distribution defined by weight vector A* with 1 ' 1 -norm 
II A*||i. For a finite set of bounded, but not necessarily binary features, this difference is 
at most 0(11 A*|| 1 i/(ln n) /m) where n is the number of features. Thus, for a moderate 
number of samples, our method generates a density estimate that is almost as good as the 
best possible, and the difference can be bounded non-asymptotically. Moreover, these 




474 



M. Dudik, S.J. Phillips, and R.E. Schapire 



bounds are very moderate in terms of the number or complexity of the features, even 
admitting an extremely large number of features from a class of bounded VC-dimension. 

Previous work on maxent regularization justified modified loss functions as either 
constraint relaxations [2,10], or priors over Gibbs distributions [2,8], Our regularized 
loss also admits these two interpretations. As a relaxed maxent, it has been studied by 
Kazama and Tsujii [10] and as a Laplace prior by Goodman [8]. These two works give 
experimental evidence showing benefits of ^i-style regularization (Laplace prior) over 
f^-style regularization (Gaussian prior), but they do not provide any theoretical guaran- 
tees. In the context of neural nets, Laplace priors have been studied by Williams [20]. 
A smoothened version of f i-style regularization has been used by Dekel, Shalev-Shwartz 
and Singer [5]. 

Standard maxent algorithms such as iterative scaling [4,6], gradient descent, Newton 
and quasi-Newton methods [11,16] and their regularized versions [2,8,10,20] perform 
a sequence of feature weight updates until convergence. In each step, they update all 
feature weights. This is impractical when the number of features is very large. Instead, 
we propose a sequential update algorithm that updates only one feature weight in each 
iteration, along the lines of algorithms studied by Collins, Schapire and Singer [3]. This 
leads to a boosting-like approach permitting the selection of the best feature from a very 
large class. For instance, the best threshold feature associated with a single variable can be 
found in a single linear pass through the (pre-sorted) data, even though conceptually we 
are selecting from an infinite class of features. In Section 4, we describe our sequential- 
update algorithm and give a proof of convergence. Other boosting-like approaches to 
density estimation have been proposed by Welling, Zemel and Hinton [19] and Rosset 
and Segal [15]. 

For cases when the number of features is relatively small, yet we want to prevent 
overfitting on small sample sets, it might be more efficient to minimize the regularized 
log loss by parallel updates. In Section 5, we give the parallel-update version of our 
algorithm with a proof of convergence. 

In the last section, we return to our application to species distribution modeling. 
We present learning curves for relaxed maxent for four species of birds with a varying 
number of occurrence records. We also explore the effects of regularization on the log 
loss over the test data. A more comprehensive set of experiments is evaluated in the 
companion paper [13]. 



2 Maximum Entropy with Relaxed Constraints 

Our goal is to estimate an unknown probability distribution tt over a sample space X 
which, for the purposes of this paper, we assume to be finite. We are given a set of 
samples x\, . . . , Xm drawn independently at random according to tt. The corresponding 
empirical distribution is denoted by tt: 

7t(x) = < i < m : Xi = x}|. 

We also are given a set of features fi, ■ ■ ■ , fn where fj-.X^ K. The vector of all n 
features is denoted by /. For a distribution tt and function /, we write 7t[/] to denote the 
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expected value of / under distribution tt (and sometimes use this notation even when tt 
is not necessarily a probability distribution): 



^[/] = 



In general, tt may be quite distant, under any reasonable measure, from tt. On the 
other hand, for a given function /, we do expect 7f[/], the empirical average of /, to 
be rather close to its true expectation 7 t[/]. It is quite natural, therefore, to seek an 
approximation p under which fj’s expectation is equal to Tr[fj] for every fj. There 
will typically be many distributions satisfying these constraints. The maximum entropy 
principle suggests that, from among all distributions satisfying these constraints, we 
choose the one of maximum entropy, i.e., the one that is closest to uniform. Here, as 
usual, the entropy of a distribution p on X is defined to be H(p) = — Inp(x). 

Alternatively, we can consider all Gibbs distributions of the form 



qx{x) 



Zx 



where Za = is a normalizing constant, and A € M". Then it can be 

proved [6] that the maxent distribution described above is the same as the maximum 
likelihood Gibbs distribution, i.e., the distribution gx that maximizes n::i9A(x.),or 
equivalently, minimizes the empirical log loss (negative normalized log likelihood) 



L#(A) 



^ m 

V'lngA(a;i) 

m f ^ 






-7r[lngA] 



(1) 



A related measure is the relative entropy (or Kullback-Leibler divergence), defined as 



RE(7t II qx) = 7f[ln(7r/(7A)]. 

The log loss and the relative entropy differ only by the constant H(7 t). We will use the 
two interchangeably as objective functions. 

Thus, the convex programs corresponding to the two optimization problems are 



V : max H(n) subject to Q : min L#(A) 

peA agR" 

P[fj] = ^[fj] 

where A is the simplex of probability distributions over X. 

This basic approach computes the maximum entropy distribution p for which p[/y] = 
7f[/y]. However, we do not expect 7r[/j] to be equal to Tr[fj] but only close to it. Therefore, 
in keeping with the motivation above, we can soften these constraints to have the form 



\p[fj]-^[fj]\<Pj (2) 

where fij is an estimated upper bound of how close x[fj], being an empirical average, 
must be to its true expectation Tr[fj]. Thus, the problem can be stated as follows: 

maxH(p) subject to 

peA 
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This corresponds to the convex program: 

V' : max H(») subject to 

pg(R+)X 

'ExSXPi^) = 1 (^o) 

p[fj]-^fj]<Pj (A”) 

To compute the convex dual, we form the Lagrangian (dual variables are indicated next 
to constraints) to obtain the dual program 



min max 
AoeR pe(R+)^ 

A-.A+SR+ 



H(p) - Ao ([ExgxP(2^)] - 1) 



+ 5^(A+ - A^. ) (p[/,] - 7r[/,]) + ^ /3,-(A+ + A^. )J . 

3 3 



Note that we have retained use of the notation p[f] and H(p), with the natural defini- 
tions, even though p is no longer necessarily a probability distribution. Without loss 
of generality we may assume that in the solution, at most one in each pair A^ , XJ is 
nonzero. Otherwise, we could decrease them both by a positive value, decreasing the 
value of the third sum while not affecting the remainder of the expression. Thus, if we 
set Xj = A^ — A“ then we obtain a simpler program 



min 
Aq , a j gir 



max 

p^{R+)X 



H(p)-Ao i[J2x€xPi^)] - 

3 



- ^\f3])+^PM3 

3 



The inner expression is differentiable and concave inp(x) . Setting partial derivatives with 
respect to p{x) equal to zero yields that p must be a Gibbs distribution with parameters 
corresponding to dual variables Xj and In Za = Aq H- 1. Hence the program becomes 



min 



H(gA) + A-(<?A[/] 



3 



(3) 



Note that 



H(9a) = -(?A[lngA] = -qx[X ■ f - InZx] = -X- qx[f] + InZx- 
Hence, the inner expression of Eq. (3) becomes 

— A • 7t[/] + In Za + Pj\Xj I = Ls-(A) + | Aj |. (4) 

3 3 

(See Eq. (5) below.) Denoting this function by L? (A), we obtain the final version of the 
dual program 

Q' : nunLf(A). 

Thus, we have shown that maxent with relaxed constraints is equivalent to minimizing 
L^(A). This modified objective function consists of an empirical loss term L*(A) plus 
an additional term Pj \ Xj \ that can be interpreted as a form of regularization limiting 
how large the weights Xj can become. 
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3 Bounding the Loss on the Target Distribution 

In this section, we derive bounds on the performance of relaxed maxent relative to the 
true distribution tt. That is, we are able to bound L,r(A) in terms of L,r(A*) when A 
minimizes the regularized loss and q\* is an arbitrary Gibbs distribution, in particular, the 
Gibbs distribution minimizing the true loss. Note that RE(7 t || q\) differs from L,r(A) 
only by the constant term H(7t), so analogous bounds also hold for RE(7t || q^). 

We begin with the following simple lemma on which all of the bounds in this section 
are based. The lemma states that the difference between the true and empirical loss of 
any Gibbs distribution can be bounded in terms of the magnitude of the weights Xj and 
the deviation of feature averages from their means. 

Lemma 1. Let q\ be a Gibbs distribution. Then 

n 

|U(A)-L^(A)| <^|A,||7f[/,]-7r[/,]| 

7=1 

Proof. Note that 

L#(A) = -7r[ln ^a] = -7t[A • / - In Za] = - A • 7r[/] + In Za- (5) 

Using an analogous identity for L^(A), we obtain 

|L#(A) - L^(A)| = |-A • 7t[/] +lnZA + a • 7t[/] -lnZx\ 

n 

= 1^- (ii[/] -ir[/])| < ° 

7=1 

This lemma yields an alternative motivation for minimizing L?. For if we have 
bounds \Tr[fj] — Tt[fj]\ < Pj, then the lemma implies that L,r(A) < L^(A). Thus, in 
minimizing L?(A), we also minimize an upper bound on L.„.(A), the true log loss of A. 

Next, we prove that the distribution produced using maxent cannot be much worse 
than the best Gibbs distribution (with bounded weight vector), assuming the empirical 
errors of the features are not too large. 

Theorem 1. Assume that for each j, |7r[/j] — TT[fj] | < Pj. Let A minimize the regular- 
ized log loss L? (A). Then for an arbitrary Gibbs distribution q\* 

n 

W(A)<L^(A*) + 2^/3,|A*|. 

7=1 

Proof. 



L.(A)<L^(A)+E,-/3,|A,|=Lf(A) 


(6) 


<Lf(A*)=L^(A*)+E,/3,|A*| 


(7) 


<L^(A*)+2E,/3,|A*|. 


(8) 



Eqs. (6) and (8) follow from Lemma 1, Eq. (7) follows from the optimality of A. □ 
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Thus, if we can bound | tt [/j ] — tt [fj ] \ , then we can use Theorem 1 to obtain a bound on 
the true loss L,r(A). Fortunately, this is just a matter of bounding the difference between 
an empirical average and its expectation, a problem for which there exists a huge array 
of techniques. For instance, when the features are bounded, we can prove the following: 

Corollary 1. Assume that features /i, . . . , are bounded in [0, 1]. Let 5 > 0 and let 
A minimize (A) with f3j = j3 = i/ln(2n/i5)/(2m) for all j. Then with probability at 
least 1 — 5, for every Gibbs distribution q\-, 

W(A)<W(A*) + 2||A*||^yi^^(^. 

Proof. By Hoeffding’s inequality, for a fixed j, the probability that \Tr[fj] —^[fj] \ exceeds 
P is at most = S/n. By the union bound, the probability of this happening for 

any j is at most S. The corollary now follows immediately from Theorem 1 . □ 

Similarly, when the fj ’s are selected from a possibly larger class of binary features 
with VC-dimension d, we can prove the following corollary. This will be the case, 
for instance, when using threshold features on k variables, a class with VC-dimension 
0{lnk). 

Corollary 2. Assume that features are binary with VC-dimension d. Let 5 > 0 and let A 
minimize L?(A) with Pj = P = i/[(iln(em^/ci) -|-ln(l/5) -I- ln(4e®)] /(2m) /or all j. 
Then with probability at least 1 — 5, for every Gibbs distribution q\*. 



L^(A) <L^(A*) + 2||A* 



d\n{ew? /d) ln(l/5) -I- ln(4e®) 

2m 



Proof. In this case, a uniform-convergence result of Devroye [7], combined with Sauer’s 
Lemma, can be used to argue that \Tr[fj] — ^[fj]\ < P for all fj simultaneously, with 
probability at least 1 — 5. □ 



As noted in the introduction, these corollaries show that the difference in perfor- 
mance between the density estimate computed by minimizing and the best Gibbs 
distribution (of bounded norm), becomes small rapidly as the number of samples to 
increases. Moreover, the dependence of this difference on the number or complexity of 
the features is quite moderate. 



4 A Sequential-Update Algorithm and Convergence Proof 

There are a number of algorithms for finding the maxent distribution, especially iterative 
scaling and its variants [4,6]. In this section, we describe and prove the convergence 
of a sequential-update algorithm that modifies one weight \j at a time, as explored 
by Collins, Schapire and Singer [3] in a similar setting. This style of coordinate-wise 
descent is convenient when working with a very large (or infinite) number of features. 

As explained in Section 2, the goal of the algorithm is to find A minimizing the 
objective function L!? (A) given in Eq. (4). Our algorithm works by iteratively adjusting 
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Input: Finite domain X 

features /i, . . . , fn where /^ : X — >■ [0, 1] 
examples xi, . . . , Xm G X 
nonnegative regularization parameters Pi, .. . ,Pn 
Output: Ai, A 2 , . . . minimizing L?(A) 

Let Ai = 0 



Fort = 1,2, 



- let U,5) = arg min 
(i,s) 



- X 



where Fj{\, S) is the expression appearing in Eq. (12) 
Kj + 5 if j' = j 

Xfj' else 






Fig. 1. A sequential-update algorithm for optimizing the regularized log loss. 



the single weight Xj that will maximize (an approximation of) the change in L?. To be 
more precise, suppose we add 6 to Xj . Let A' be the resulting vector of weights, identical 
to A except that A' = Xj + S. Then the change in L? is 

Lf (A') - Lf (A) = A • ^[/] - A' • #[/] + In - In + /3,(|A' | - |A,|) (9) 

= + + Pj{\Xj + S\ - |Aj|) (10) 

< -Sn[fj] + Hqx [1 + (e^ - 1)/,] ) + /3y(|A, + - |A, |) (1 1) 

= + ln(l + (e^ - l)qx[fj]) + Pj{\Xj + <5| - |Aj|). (12) 

Eq. (9) follows from Eq. (5). Eq. (10) uses 

Zx> = = ZxJ2 (13) 

x^X x^X 

Eq. (1 1) is because < 1 + (e*^ — l)x for x G [0, 1]. 

Let Fj (A, S) denote the expression in Eq. (12). This function can be minimized over 
all choices of G R via a simple case analysis on the sign of Xj + S. In particular, using 
calculus, we see that we only need consider the possibility that S = —Xj or that 6 is 
equal to 

, f - qx[fj]) \ , / (^[/j] + /3j)(i-gA[/j-]) \ 

(1 - ir[/,] + PMf,] J \ {I- m - PMf,] J 

where the first and second of these can be valid only if Ay + 5 > 0 and Aj + <5 < 0, 
respectively. 

This case analysis is repeated for all features fj. The pair (j, 5) minimizing Fj{\, 5) 
is then selected and 6 is added to Ay . The complete algorithm is shown in Eigure 1 . 

The following theorem shows that this algorithm is guaranteed to produce a sequence 
of A(’s minimizing the objective function L? in the case of interest where all the /3y’s 
are positive. A modified proof can be used in the unregularized case in which all the /3y ’s 



are zero. 
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Theorem 2. Assume all the Pj ’s are strictly positive. Then the algorithm of Figure 1 
produces a sequence Ai, A 2 , . . . for which 

lim L?(At) = minLf(A). 

t—¥oo A 

Proof Let us define the vectors A+ and A^ in terms of A as follows: for each j, if 
Xj > 0 then = Xj and A“ = 0, and if Xj < 0 then Xj~ = 0 and X~ = —Xj. Vectors 
A+, A^, A^, Af, etc. are defined analogously. 

We begin by rewriting the function Fj. For any A, S, we have that 

|A + (5| - |A| = min{(5+ + , 5 - | (5+ > -A+, h" > -A", 5+ - S~ = ,5}. (14) 

This can be seen by a simple case analysis on the signs of A and A + <5. Plugging into 
the definition of Fj gives 

Fj{X,6) =mm{Gj{X,6+,6-) \ (5+ > -A+,h" > -X~,S+ -6~ = h} 

where 

G,(A, h+, h-) = (< 5 - - S+Mff + In (1 + - l)gA[/,]) + + 5~). 

Combined with Eq. (12) and our choice of j and S, this gives that 

Lf(At+i) - Lf (At) < min min Fj (At, (5) 

J <5 

= minmin{Gj(At,5''',h“) | h'*' > —Xfj,S~ > —Xfj}{15) 

Let minG(At) denote this last expression. 

Since Gj{X, 0, 0) = 0, it follows that minG(At) is not positive and hence L^(At) 
is nonincreasing in t. Since log loss is nonnegative, this means that 

< Lf(Ai) < 00. 
j 

Therefore, using our assumption that the /3j’s are strictly positive, we see that the At’s 
must belong to a compact space. 

Since Xfs come from a compact space, in Eq. (15) it suffices to consider updates 5+ 
and S~ that come from a compact space themselves. Eunctions Gj are uniformly con- 
tinuous over these compact spaces, hence the function minG is continuous. 

The fact that At’s come from a compact space also implies that they must have a 
subsequence converging to some vector A. Clearly, L? is nonnegative, and we already 
noted that L^(At) is nonincreasing. Therefore, limt_>.oo L‘?(At) exists and is equal, by 
continuity, to L^( A) . Moreover, the differences L'? (At+i) — L^(At) must be converging 
to zero, so minG(At), which is nonpositive, also must be converging to zero by Eq. (15). 
By continuity, this means that minG(A) = 0. In particular, for each j, we have 

min{Gj(A, .5+, | 6+ > -A+, S~ > -XJ} = 0. (16) 
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We will complete the proof by showing that this equation implies that A+ and together 

with satisfy the KKT (Kuhn-Tucker) conditions [14] for the convex program V, 
and thus form a solution to this optimization problem as well as to its dual Q', the 
minimization of L?. For p = q^, these conditions work out to be the following for all j: 

> 0, n[fj] - q^ifj] < I3j, A+(7i[/j] - q^ifj] - (3j) = 0 (17) 

> 0, q^[fj\ - n[fj\ < (3j, {q-^[fj] - ^fj] - (3j) = 0. (18) 

Recall that Gj(A,0,0) = 0. Thus, by Eq. (16), if >0 then Gj(A,<5+,0) is 
nonnegative in a neighborhood of (5+ = 0, and so has a local minimum at this point. 
That is. 



0 = 



5Gj(A,(5+,0) 

cM+ 



( 5+=0 



= + Pj- 



If A^ = 0, then Eq. (16) gives that Gj{\, 0, 0) > 0 for <5+ > 0. Thus, Gj{\, J'*', 0) 
cannot be decreasing at 5+ = 0. Therefore, the partial derivative evaluated above must 
be nonnegative. Together, these arguments exactly prove Eq. (17). Eq. (18) is proved 
analgously. 

Thus, we have proved that 

lim Lf(At) = Lf (A) = minkf (A). □ 

t—¥oo A 



5 A Parallel-Update Algorithm 



Much of this paper has tried to be relevant to the case in which we are faced with a 
very large number of features. However, when the number of features is relatively small, 
it may be reasonable to minimize the regularized loss L?(A) using an algorithm that 
updates all features simultaneously on every iteration. There are quite a few algorithms 
that do this for the unregularized case, such as iterative scaling [4,6], gradient descent, 
Newton and quasi-Newton methods [11,16]. 

Williams [20] outlines how to modify any gradient based search to include -style 
regularization. Kazama and Tsujii [10] use a gradient based method that imposes ad- 
ditional linear constraints to avoid discontinuities in the first derivative. Regularized 
variants of iterative scaling were proposed by Goodman [8], but without a complete 
proof of convergence. In this section, we describe a variant of iterative scaling with a 
proof of convergence. Note that the gradient based or Newton methods might be faster 
in practice. 

Throughout this section, we make the assumption (without loss of generality) that, 
for all X G X, fj{x) > 0 and fj{x) < 1. Like the algorithm of Section 4, our 
parallel-update algorithm is based on an approximation of the change in the objective 
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function L^, in this case the following, where A' = A + 

Lf (A') - Lf (A) = A • #[/] - A' . ^f] + InZv - InZ^ + ^/3,'(|A' | - |A,|) 

3 

= +lngA[exp(^ • /)] +<5j| - |Aj|) (19) 



< 



+ 9A[/j](e‘^'’ “ 1) + “ lAjI) 



■ ( 20 ) 



Eq. (19) uses Eq. (13). EorEq. (20), note first that, if Xj € R andpj > 0 with pj < 1 
then 



exp (j2jPjXj) - 1 < - !)• 



(See Collins, Schapire and Singer [3] for a proof.) Thus, 



IngA exp^X^jJj/j) 



<lngA[l + E,/.(e'^-l) 

= ln(l + E,<?A[/,](e^^-l)) 
<E9A[/.](e'^-l) 



since ln(l + x) < a; for all x > —1. 

Our algorithm, on each iteration, minimizes Eq. (20) over all choices of the 6j’s. With 
a case analysis on the sign of Xj + Sj, and some calculus, we see that the minimizing Sj 
must occur when 6j = —Xj, or when 6j is either 



In 



{ ^[/j] ~/3j \ 

1 9A[/a] J 



or 



In 



{ ^[fj] + 3j \ 

V 9a[/,] j 



where the first and second of these can be valid only if Xj + <5 j > 0 and Xj + Sj < 0, 
respectively. The full algorithm is shown in Eigure 2. As before, we can prove the 
convergence of this algorithm when the /3j’s are strictly positive. 



Theorem 3. Assume all the (3j ’s are strictly positive. Then the algorithm of Figure 2 
produces a sequence Ai, A 2 , . . . for which 

lim Lf(At) = minLf(A). 

t—¥oo A 



Proof. The proof mostly follows the same lines as for Theorem 2. Here we sketch the 
main differences. 

Let us redehne Fj and Gj as follows: 

F,{X, 6) = -S7T[ff + - 1) + P,i\X, + 5| - |A,|) 



and 

G,(A, <5+,r) = (r - S+Mff + - 1) + Pj{S+ + S-). 
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Input: Finite domain X 

features /i, • • • , fn where /^ : X — >■ [0, 1] 
and fj (x) < 1 for all a: G X 

examples xi, . . . , Xm G X 
nonnegative regularization parameters Pi, .. . ,Pn 
Output: Ai, A 2 , . . . minimizing L?(A) 

Let Ai = 0 
Fort = 1,2, : 

- for each y, let = argmm (^Sn[fj] + qx[fj]{e^ - 1) + Pj{\Xj + S\ - |Aj|)) 

— update At+i ^ Xt + S 



Fig. 2. A parallel-update algorithm for optimizing the regularized log loss. 



Then by Eq. (14), 



E;-(A,,5) = min{Gj(A,,5+,,5-) | <5+ > -A+,()- > -A", (5 = ,5+ - 5 -}. 

So, by Eq. (20), 

Lf (A,+i) - Lf (At) < mmE, F,{Xt,S,) 

= Ej mini, 

= E, min{Gj(At, S+ , 5)") | > -A+ , S~ > -Xj }. 

Note that G, (A, 0, 0) = 0, so none of the terms in this sum can be positive. As in 
the proof of Theorem 2, the Aj’s have a convergent subsequence converging to some A 
for which 



Ej min{Gj(A, S+,Sj ) | ^ + > -A+ , 5j > -X^ } = 0. 

This fact, in turn, implies that A+, A~ and satisfy the KKT conditions for convex 
program V . This follows using the same arguments on the derivatives of Gj as in 
Theorem 2. □ 

6 Experiments 

In order to evaluate the effect of regularization on real data, we used maxent to model the 
distribution of some bird species, based on occurrence records in the North American 
Breeding Bird Survey [17]. Experiments described in this section overlap with the (much 
more extensive) experiments given in the companion paper [13]. 

We selected four species with a varying number of occurrence records: Hutton’s Vireo 
(198 occurrences). Blue-headed Vireo (973 occurrences). Yellow-throated Vireo (1611 
occurrences) and Loggerhead Shrike (1850 occurrences). The occurrence data of each 
species was divided into ten random partitions: in each partition, 50% of the occurrence 
localities were randomly selected for the training set, while the remaining 50% were set 
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number of training examples (m) 



10 100 900 




Fig. 3. Learning curves. Log loss averaged over 10 partitions as a function of the number of 
training examples. Numbers of training examples are plotted on a logarithmic scale. 



aside for testing. The environmental variables (coverages) use a North American grid 
with 0.2 degree square cells. We used seven coverages: elevation, aspect, slope, annual 
precipitation, number of wet days, average daily temperature and temperature range. 
The first three derive from a digital elevation model for North America [18], and the 
remaing four were interpolated from weather station readings [12]. Each coverage is 
defined over a 386 x 286 grid, of which 58,065 points have data for all coverages. 

In our experiments, we used threshold features derived from all environmental vari- 
ables. We reduced the fij to a single regularization parameter j3 as follows. We expect 
Wifj] ~ ^[fj]\ ~ where (j[fj] is the standard deviation of fj under tt. We 

therefore approximated CT[/j] by the sample deviation (f[/j] and used = f]a[fj]/ ^/m. 
We believe that this method is more practical than the uniform convergence bounds from 
section 3, because it allows differentiation between features depending on empirical er- 
ror estimates computed from the sample data. In order to analyze this method, we could, 
for instance, bound errors in standard deviation estimates using uniform convergence 
results. 

We ran two types of experiments. First, we ran maxent on increasing subsets of the 
training data and evaluated log loss on the test data. We took an average over ten partitions 
and plotted the log loss as a function of the number of training examples. These plots 
are referred to as learning curves. Second, we also varied the regularization parameter (3 
and plotted the log loss for fixed numbers of training examples as functions of (3. These 
curves are referred to as sensitivity curves. In addition to these curves, we give examples 
of Gibbs distributions returned by maxent with and without regularization. 

Fig. 3 shows learning curves for the four studied species. In all our runs we set 
(3 = 1.0. This choice is justified by the sensitivity curve experiments described below. 
In the absence of regularization, maxent would exactly fit the training data with delta 
functions around sample values of the environmental variables. This would result in 
severe overfitting even when the number of examples is large. As the learning curves 
show, the regularized maxent does not exhibit this behavior, and finds better and better 
distributions as the number of training examples increases. 

In order to see how regularization facilitates learning, we examine the resulting 
distributions. In Fig. 4, we show Gibbs distributions returned by a regularized and an 
insufficently regularized run of maxent on the first partition of the Yellow-throated Vireo. 
To represent Gibbs distributions, we use feature profiles. For each environmental vari- 
able, we plot the contribution to the exponent by all the derived threshold features as 
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Fig. 4. Feature profiles learned on the first partition of the Yellow-throated Vireo. For every en- 
vironmental variable, its additive contribution to the exponent of the Gibbs distribution is given 
as a function of its value. Profiles for the two values of have been shifted for clarity — this 
corresponds to adding a constant in the exponent; it has, however, no effect on the resulting model 
since constants in the exponent cancel out with the normalization factor. 



in 

in 

O 



05 

O 



Hutton’s Vireo Blue-h. Vireo Yeilow-th. V. Loggerh. Sh. 




regularization value ((3) 



10 and 100 training examples 
32 and 316 training examples 



Fig. 5. Sensitivity curves. Log loss averaged over 10 partitions as a function of ft for a varying 
number of training examples. For a fixed value of /3, maxenf finds better solutions (with smaller 
log loss) as the number of examples grows. We ran maxent with 10, 32, 100 and 316 training 
examples. Curves from top down correspond to these numbers; curves for higher numbers are 
missing where fewer training examples were available. Values of ft are plotted on a log scale. 



a function of the value of the environmental variable. This contribution is just the sum 
of step functions corresponding to threshold features weighted by the corresponding 
lambdas. As we can see, the value of /3 = 0.01 only prevents components of A from 
becoming arbitrarily large, but it does little to prevent heavy overfitting with many peaks 
capturing single training examples. Raising /3 to 1.0 completely eliminates these peaks. 

Fig. 5 shows the sensitivity of maxent to the regularization value f3. Note that the 
minimum log loss is achieved consistently around /3 = 1.0 for all studied species. This 
suggests that for the purposes of maxent regularization, <j[fj] are good estimates of 
\^[fj] — 7t[/j]| and that the maxent criterion models the underlying distribution well, 
at least for threshold features. Log loss minima for other feature types may be less 
consistent accross different species [13]. 
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Abstract. Learning probabilities (p-concepts [13]) and other real-valued 
concepts (regression) is an important role of machine learning. For ex- 
ample, a doctor may need to predict the probability of getting a disease 
P[y\x], which depends on a number of risk factors. 

Generalized additive models [9] are a well-studied nonparametric model 
in the statistics literature, usually with monotonic link functions. How- 
ever, no known efficient algorithms exist for learning such a general class. 
We show that regression graphs efficiently learn such real-valued con- 
cepts, while regression trees inefficiently learn them. One corollary is 
that any function Afyja:] = u(w ■ x) for u monotonic can be learned to 
arbitrarily small squared error e in time polynomial in 1/e, jwji, and the 
Lipschitz constant of u (analogous to a margin). The model includes, as 
special cases, linear and logistic regression, as well as learning a noisy 
half-space with a margin [5,4]. 

Kearns, Mansour, and McAllester [12,15], analyzed decision trees and 
decision graphs as boosting algorithms for classification accuracy. We 
extend their analysis and the boosting analogy to the case of real- valued 
predictors, where a small positive correlation coeffieient can be boosted 
to arbitrary accuracy. Viewed as a noisy boosting algorithm [3,10], the 
algorithm learns both the target function and the asymmetric noise. 



1 Introduction 

One aim of machine learning is predicting probabilities (such as p-concepts [13]) 
or general real values (regression). For example, Figure 1 illustrates the standard 
prediction of relapse probability for non-Hodgkin’s lymphoma, given a vector of 
patient features. In this application and many others, probabilities and real- 
valued estimates are more useful than simple classification. 

A powerful statistical model for regression is that of generalized linear models 
[16], where the expected value of the dependent variable y can be written as 
A]?/ja;] = u{w ■ x), an arbitrary link function m : R — 1 K of a linear function of 
the feature vector x € R". Our results apply to mono-linear functions, where u 
is monotonic and Lipschitz continuous.^ 

Linear and logistic regression both learn mono-linear functions. The model 
also captures (noisy) linear threshold functions with a margin [5,4].^ 

^ A function u is Lipschitz continuous with constant L if [u(a) — u{b)\ < L\a — b\ for 
all a, & G R. (For differentiable u, ]M^(a)[ < L.) 

^ For a linear threshold function, L = 1/margin. 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 487—501, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 
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# Risk Factors 


complete response 
rate 


relapse-tree 
2-vear survival 


relapse-tree 
5-vear survival 


2-year survival 


5-year survival 


0,1 


87% 


79% 


70% 


84% 


73% 


2 


67% 


66% 


50% 


66% 


51% 


3 


55% 


59% 


49% 


54% 


43% 


4,5 


44% 


58% 


40% 


34% 


26% 



Risk Factors: xi > 60, X 2 > 2, xa > 2, xa > normal, and *5 > 3. 

{xi = age, X2 = # extranodal sites, X3 = performance status, X4 = LDH, X5 = stage.) 

Fig. 1. Non-Hodgkin’s lymphoma International Prognostic Index probabilities [21]. 
Each probability (column) can be written in the form u{I(x\ > 60) I(x 3 > 3)) 

for monotonic u, but does not fit a linear or logistic (or threshold) model. 



In fact, our results apply to the more general generalized additive models. 
Random examples are seen from a distribution T> over At xy, where X = R” and 
3^ C M. (F = {0,1} corresponds to probability learning [13].) The assumption 
is that f{x) = iH[ 2 /|x] = u{^^Vi{xi)), where u is a continuous monotonic link 
function and each Ui : M — 1 - R is an arbitrary function of bounded total variation^ . 

A regression tree is simply a decision tree with real (rather than binary) 
predictions in the leaves. A decision graph (also called branching program, 
DAG, or binary decision diagram) is a decision tree where internal nodes 
may be merged. We suggest the natural regression graph, which is a decision 
graph with real-valued predictions in the leaves (eq. a regression graph with 
merging). We give an algorithm for learning these functions that is deriva- 
tive of Mansour and McAllester [15]. We show that, for error of h defined as 
e{h) = Exi[[h{x) — /(x)) ], the error of regression graphs decreases quickly, 
while regression trees suffer from the “curse of dimensionality.” 

Theorem 1. Let V he a distribution on X x y, where T C R" and 3^ C [0, 1] . 
Suppose f{x) = if[y|x] = u{f^Vi{xi)), where u is monotonic (nondecreasing or 
nonincreasing). Let L he the Lipschitz constant of u and V = Y) ^vi is the sum 
of the total variations of Vi . 

1. Natural top-down regression graph learning, with exact values of leaf weights 
and leaf means, achieves e{R) < e with size{R) < L^V^/{\Qe^). 

2. For regression trees with exact values, e{R) < e with size{R) < 2(1.04)^ ^ . 

While the above assumes knowing the exact values of parameters, standard tools 
extend the analysis to the case of estimation, as described in Section 5.3. Also, 
notice the Winnow-like dependence on V . In the case where each Vi{xi) = WiXi 
and fb = [0,1]”, V = W = If /(^) is a linear threshold function of 

boolean X = {0,1}, and Wi G Z, then V = W and u can be chosen with L = 1, 
since the increase from u(z) = 0 to u{z) = 1 happens between integer z’s. Since 
the sample complexity depends only logarithmically on the n, if there are only a 

® The total variation of v is how much “up and down” it goes. For differentiable 
functions, it’s \v'{a)\da. For monotonic functions it’s sup;,u(a) — infau(a). 
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few relevant dimensions (with small W) then the algorithm will be very attribute 
efficient . 



1.1 Real- Valued Boosting 

In learning a regression graph or tree, one naturally searches for binary splits 
of the form Xi > 0. We first show that there always exists such a split with 
positive correlation coefficient. We then show that a positive correlation leads to 
a reduction in error. 

This is clearly similar to boosting, and we extend the analyses of Kearns, 
Mansour, and McAllester, who showed that decision trees and more efficiently 
decision graphs can perform a type of boosting [20] . Rather than a weakly ac- 
curate hypothesis (one with accuracy P[h{x) = f{x)] > 1/2), we use weakly 
correlated hypotheses that have correlation bounded from 0. This is similar to 
the “okay” learners [10] designed for noisy classification.^ 

2 Related Work 

While generalized additive models have been studied extensively in statistics [9] , 
often with monotonic link functions, to the best of our knowledge no existing 
algorithm can efficiently guarantee e{h) < e for arbitrarily small e, even though 
such guarantees exist for much simpler single-variable problems. 

For example, an algorithm for efficiently learning a monotonic function of a 
single variable a; G K, f{x) = if[j/la;] was given by Kearns and Schapire [13]. 
Statisticians also have efficient learning algorithms for this scatterplot smoothing 
problem. 

For the important special case of learning a linear threshold function with 
classification noise. Bylander showed that Perceptron-like algorithms are effi- 
cient in terms of a margin [5]. This would correspond to u = rj for negative 
examples, u = 1 — rj for positive examples, and linearly increasing at a slope 
of (1 — 2?7) /margin in between, where rj is the noise rate. Blum et. al. removed 
the dependence on the margin [4] . Bylander also proved efficient classification in 
the case with a margin and random noise that monotonically and symmetrically 
decreased in the margin. It would be very interesting if one could extend these 
techniques to a non-symmetric noise rate, as symmetric techniques for other 
problems, such as learning the intersection of half-spaces with a symmetric den- 
sity [1], have not been extended. 



^ As observed in [10], correlation is arguably a more popular and natural measure 
of weak association between two random variables than accuracy, e.g. the boolean 
indicators f{x) = “person x lives in Chicago” and h{x) = “person x lives in Texas” 
are negatively correlated, but have high accuracy P[h{x) = f{x)]. 
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3 Definitions 



We use the Kearns and Schapire’s definition of efficient learnability in a real- 
valued setting [13]. There is a distribution V over X xy. Kearns and Schapire 
take binary labels y = {0, 1} in the spirit of learning probabilities and PAC 
learning [22]. In the spirit of regression, we include real labels 3^ C M, though 
the theory is unchanged. The target function is f{x) = E[y\x\. 

An algorithm A learns concept class C of real- valued functions from X , if, for 
every e, 5 > 0 and every distribution V over Xxy such that Exi[y\x\ = f{x) G C, 
given access to random labelled examples from T>, with probability 1 — 5, A 
outputs hypothesis h with error, 

e{h) = Ev[{h{x) - /(a:))^] < e. 

It efficiently learns if it runs in time polynomial in 1/e, 1/5, and size(/).^ 

While e{h) cannot directly be estimated, E[{h{x) — yy] can be and is related: 

E-n[{h{x) - yf] = Effiifix) - yf] + Effi{h{x) - f{x))% 

Let the indicator function I{P) = 1 if predicate P holds and 0 otherwise. 
Recall various statistical definitions for random variables u, u G M. 

yu = E[u] 

cov{u,v) = E[{u - yLu){v - /x„)] = E[uv] - yLuHv 
var(u) = = cov{u, u) = E[{u — /r„)^] = E[u'^] — 

= i/varfix) 

cov(m, v) 

cor(u,v) = puv = 

O'u 

In most of the analysis, the random variables /, /i : — >■ R can either be thought 

of as functions or the induced random variables for x from T>. We use pfh or 
Pf(x)h{x), as is convenient. We will use a few properties of covariance. It is shift 
invariant, i.e. cov('u -I- c,v) = cov{u,v) for a constant c. It is symmetric and 
bilinear, i.e. 



COv(ciMi -I- C2U2,v) = CiCOv(ui, w) -|- C2COv(u2, w), 



for constants Ci,C 2 . 

The (possibly infinite) Lipschitz constant of a function u : K — >■ M is. 



L = sup 

a^b 




u{b)\ 

b\ 



Let Vg be the total variation of a function g : R — > R, which can be defined as 
the following maximum over all increasing sequences of Qi G R. 



fc-i 

Kg = sup sup |g(a^+i) - 9{ai)\- 

® In our example size(/) = LV, where L is a Lipschitz constant and V is total variation. 
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4 Top-Down Regression Graph Learning 

For our purposes, a regression tree i? is a binary tree with boolean split predi- 
cates, functions from If to {0, 1}, at each internal node. The leaves are annotated 
with real numbers. A regression graph R is just a regression tree with merges. 
More specifically, it’s a directed acyclic graph where each internal node again 
has a boolean split predicate and two labelled outgoing edges, but children may 
be shared across many parents. The internal nodes determine a partition of X 
into the leaves. The weight of a leaf \s wi = P[x & P\. The value of a leaf ^ is 
= E[y\x £ P\. We define the prediction R[x) to be the value of the leaf that 
X falls into. (These quantities are exact; estimation is discussed in the next sec- 
tion.) This enables us to discuss the correlation coefficient and other quantities 
relating to R. We also define the distribution 2?^, which is the distribution T> 
restricted to the leaf i. 

It is straightforward to verify that p,y = p,f = fXR = Most decision 

tree algorithms work with a potential function, such as e{R) = E[(^R{x) — /(x))"^], 
and make each local choice based on which one decreases the potential most. In 
Appendix C, we show that all of the following potential functions yield the same 
ordering on graphs: 

-pRf, -ct|, -^Wi{a{qi-b)Y , '^wiAqi{l - qi) 
l t 



We use the second one, G(i?) = — because it is succ. in terms of w^, qt. 

However, the (a, 6) formulation (for a 0, 6 G K) illustrates that minimizing 
G{R) is scale-invariant (and shift-invariant), which mean that the algorithm can 
be run as-is even if Y is larger than [0, 1] (and the guarantees scale accordingly). 
Also, the last quantity shows that it is equivalent to the Gini splitting criterion 
used by CART [6]. 

A natural top-down regression graph learning algorithm with stopping pa- 
rameter Amin is as follows. We start with a single leaf and repeat: 

1. Sort leaves so that qi^ <qi^ < ■ ■ ■ < qe^- (-^ = # of leaves.) 

2. Merge leaves £a,^a+i , ... ,4 into a single internal node. Split this node into 
two leaves with a split of the form {xi < 9). Choose 6* G K, i G Z, and 
1 < a < b < L that minimize G{R). 

3. Repeat until the change in G{R) is less than 

Every author seems to have their own suggestion about which nodes to merge. 
Our merging rule above is in the spirit of decision trees. Several rules have been 
proposed [15,14,18,7,2,19], including some that are bottom-up. Mansour and 
McAllester’s algorithm [15] is more computationally efficient than ours, has the 
same sample complexity guarantees, but requires fixed-width buckets of leaves. 
The regression tree learner is the same without merges, i.e. a = b. The size(i?) 
is defined to be the number of nodes. 

The following lemma serves the same purpose as Lemma 5 of [12] (using 
correlation rather than classification error). 
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Lemma 1. Let h : X ^ {0,1} be a binary function. The split of i into leaves 
£o = {x G i\h{x) = 0} and = {x G £\h{x) = 1} has score (reduction in G{R) ) 
ofwegWe^{qe„ - = Wi{corve{f,h))'^varvi,{f). 

The proof is in Appendix A. We move the buckets of Mansour and McAllester 
[15] into our analysis, like [10]. 

Lemma 2. The merger of leaves £a,£a+i, ■ ■ ■ ,£b with qg^ < ... < qi,, into a 
single leaf can increase G{R) by at most + . . . + W£^){qi^ — qi^O'^ ■ 

Proof. Proof by induction on b. The case 6 = a is trivial. Let £<h = £a£). . ■ Ufb-i 
be the merger of all leaves except b. Then clearly qg^ < qi^,^ < qi,^. In terms of 
change in G{R), the merger of £b and £<{, is exactly the opposite of a split, and 
thus by Lemma 1, it increases G{R) by an additional. 



we, + we^. 



{<Ub 



< W£,{qi, 



qej^. 



5 Mono-linear and Mono-additive Learning 

Lemma 4 will show that for any mono-linear or mono-additive function, there is 
a threshold of a single attribute that has sufficiently large covariance with the 
target function. Then, using Lemmas 1 and 1 above. Lemma 5 shows that e{R) 
will become arbitrarily small. 



5.1 Existence of a Correlated Split 

Lemma 3. Let u : M — >■ M &e a monotonically nondecreasing L-Lipschitz func- 
tion. Then for any distribution over z G K, cov(u(z), z) > crfiL. 

Proof. By the bilinearity of covariance, and since = cov{u,u), the statement 
of the lemma can be rewritten as cov{u,t) > 0 for t{z) = z — u{z)fL. Note that 
t{z) is nondecreasing as well. To see this, t{z) — t{z') = z — z' — (u{z) — u{z'))fL 
which is nonnegative for z > z', by definition of L-Lipschitz. 

Now imagine picking z independently from the same distribution as z. Then, 
since sign(u(z) — u{z)) = sign(t(z) — t{z)) always, 

E[{u{z) - u{z)) (t(z) - t{z))] > 0 
E[u{z)t{z)] E[u{z)t{z)] — E[u{z)t{z)] — E[u{z)t{z)] > 0 

2E[u{z)t{z)] — 2E[u{z)]E[t{z)] > 0 

The last line follows from independence and is equivalent to cov{u,t) >0. □ 

Lemma 4. Let / : R” — >• K &e o/ the form f{x) = Vi{xi)), where u is 

monotonic and L-Liptschitz, each : R — >■ R is o function of bounded variation 
Vy^, and P = X) . Then there exists i G {1,2,..., n}, o G {<, >, <, >}, and 
0 G R, such that 

cov{I{xi 0 9),f) > -i. 




Learning Monotonic Linear Functions 493 



Proof. WLOG u is monotonically nondecreasing. A theorem from real analysis 
states that every function v of bounded variation Vy can be written as the sum 
of a monotonically nondecreasing function v\ and a monotonically nonincreasing 
function V 2 with Vy = Vy^ + Vy^ [17]. Thus, we can write, 

n n 

+ v,2{xi), 

2=1 



for monotonic Vij, and V = J2j=i ■ Let Cij = inf^,^ Vij{xi) (so Vij : M — >■ 

[Cij, Cij + Vy^^]). 

Now we argue that a random threshold function of a random attribute will 
have large covariance. Observe that for any z G [0,1]) > a)] = 2 :, 

where a is uniform over [0, 1]. Then, since {vij{xi) — Cij)jVy^. € [0, 1], 



Vij(Xi) - Ci 



= E. 



ae[0.1] 



k( 



Cij — ^Vij 

Choose z, j from the distribution P{i^j) 



Ciji^Xi^ Cij . 

^ 

\_^{'Cij{Xi) > Cij + • 

= V^j/V. Then, 



,j P,a e [0, 1] 



2 n 

> Cij + aVij)] = EE 



j=i i=i 
2 n 



= EE 



^^Ea[I{v,j{x,) > Cij + aVij)] 
XijiXi) Cij 

V 



1 . ^ 

for some constant c € M. By the bilinearity of covariance, the above, and the 
fact that covariance is immune to shifts. 



^i,j,a \cov(^f ^ I(yij(^Xi) ^ Cij T cutdj))] cov(^f ^ Ei j (y^I(^Vij(^Xi) ^ Cij T 

= cov(/, _ c) 

^ cov(/,E^»(a^i)) 

14 

From the previous lemma, the last quantity is at least a‘j/{LV). Since the above 
holds in expectation, there must be an i,j, and a for which it holds instanta- 
neously. Finally, since Vij is monotonic, I{vij{xi) > Cij + aVy^j = I(xi off) for 
some o G {<, >, <, >} and 6* G K. □ 

The dependence on a/ in the above lemma is necessary. If a/ = 0, then 
cov{h, /) must also be 0. But the lemma does gives us the following guarantee 
on correlation in terms of cr/, 

covjh, /) ^ g/ ^ ^ 

- ahLV - LV ■ 



Phf = 



(ThCTf 



(1) 
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5.2 The Implications for e(i?) 

Anticipating some kind of correlation boosting, we state the following lemma in 
terms of a guaranteed correlation p{(7^). In the above case p{z) = Ay/zjLV. 

Lemma 5. Suppose p : M — >■ M+ is a nondecreasing guarantee function such 
that, for each leaf £, there exists a split predicate h : X ^ {0, 1} of correla- 
tion corx>g{h, f) > p{varx>f;{f))- Suppose G [0,1]. Then with the regres- 
sion graph learner with = e^'^(p(e/2)) /A, error e{R) < e with at most 

splits. For the regression tree learner, after exp(l/(4(/9(|))^e^)) 
splits, e{R) < e. 

Proof. By definition of leaf variance varjj^ (/) and error e{R), 

e{R) = E-d[{R{x) - f{x)Y] = '^^weETy^[{qe - /(a;))^] = w<?vari,, (/). 

Let N be the current number of leaves. As long as e(i?) > e, there must be some 
leaf £ with both W£ > 2e/N and var-Veif) > e/2. Otherwise, the contribution 
to e{R) from leaves with var-Veif) < e/2 would be < e/2 and from the rest 
of leaves would be at most A^(2e/fV)(l/4) = e/2, since var-Deif) < 1/4 (since 
fix) G [0,1]). 

By Lemma 1, using corx)^ (/, h) > p(e/2) correlation, splitting this leaf £ gives 
a reduction in G{R) of at least, 

AG > we{p(e/2)yvar-Dtif) > (p(e/2))^e^/iV. 

Now e{R) = CTj at the start and decreases in each step, but never goes 
below 0. Also, the change in G{R) is equal to the change in e{R) since e(i?) = 
G{R) + Ex>[f{x)'^]. Thus the total change in G{R) is at most a"j < 1/4. In the 
case of regression trees, where we do splits and no merges, each split increases 
the number of leaves by 1. Thus, after T splits, 

^ (p(e/2))^e^ ^ 1 
2^ N - A' 

N=1 

Since 1/A^ > ln(T), we get the regression tree half of the lemma. 

For regression graphs, say at some point there are N leaves with values 
qi G [0,1]. Now bucket the leaves by value of qi into 1/s intervals of width 
s = p(el2)^J~ej2. For the moment, imagine merging all leaves in every bucket. 
Then there would be at most 1/s leaves, and by the above reasoning, there 
must be one of these merged leaves f U fo+i £>...£> £b with we > 2es and 
varx)^(/) > e/2 (the error e(i?) can only have increased due to the merger). Now 
imagine merging only the leaves in this bucket and not any of the others. By 
Lemma 2, the increase in G{R) due to the merger at most weiqe^ ^ wis^ ■ 

Using Lemma 1 as well, the total decrease in G{R) is at least 



AG > wi{p{e/2)Yvav-Diif) ~ 
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> wi[p{el2yf ej2 - (p(e/2))^e/4 

> (2es)(p(e/2))^e/4 
= c2-S(p(e/2))V4 

Thus there exists a merge-split that reduces G{R) by at least (p(e/2))^/4 as 
long as t{R) > e, and by choice of we will not stop prematurely. Using 

that the total reduction in G(R) is at most 1/4, completes the lemma. □ 

We are now ready to prove the main theorem. 

Proof ( of Theorem 1 ). For part 1, we run the regression graph learning algorithm 
(getting exact values oipi and qf). By (1), we have p{z) = A^/zjLV. Since size(i?) 
increases at most 2 per split, by Lemma 5, e{R) < e with 

size(i?) < 2e-^-® = e~'^{LVf /^V2 < e~^{LVf /IQ. 

We use = A-j2e‘^ / {LVY to guarantee we get this far and don’t run too 

long. Similarly, for regression trees in part 2, by Lemma 5, since p(e/2)^ = 
8e/(LU)^, size(i?) < 2exp((LU)^/e^/32). Finally, < 1.04. 

5.3 Estimations Via Sampling 

Of course, we don’t have exact values of gt = wiq/ for each leaf, so one must use 
estimates. For simplicity of analysis, we use fresh samples to estimate this quan- 
tity (the only quantity necessary) for each leaf. (Though a more sophisticated 
argument could be used, since the VC dimension of splits is small, to argue that 
one large sample is enough.) It is not difficult to argue that if each estimate of 
Pi, for each potential leaf £ encountered, is accurate to within, say r = Z\min/10, 
the algorithm will still have the same asymptotic guarantees. 

While it is straightforward to estimate to within fixed additive tolerance, 
estimating qi to within fixed additive tolerance is not necessarily easy when wg, 
is small. However, if wg is very small, then gg is also small. More precisely, if 
wg < r/2 and the estimate is accurate to within tolerance r/10, then we can 
safely estimate gg = 0 and still be accurate to within t . On the other hand, if 
Wg > T, then it takes only 1 /t samples to get one from leaf wg, and we can 
estimate qg to additive accuracy r/10 and thus gg to additive accuracy r. 

To have failure probability l/S, the number of samples required depends 
polynomially on l/e,log(n/5), and size(i?). The poly — log(n) dependence on n 
can be good in situations where there are only a few relevant attributes and TV 
is small. 

6 Correlation Boosting 

Lemma 5 is clearly hiding a statement about boosting. Recall that in classifica- 
tion boosting, a weak learner, is basically an algorithm that output a boolean 
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hypothesis h with accuracy P[h{x) = f{x)] > 1/2 + 7 (for any distribution), 
where I/7 is polynomial in size(/). Then the result was that the accuracy could 
be “boosted” to 1 — e in time poly{l/e,size{f)) . We follow the same path, re- 
placing accuracy with correlation. We define a weak correlator, also similar to 
an “okay” learner [10]. 



Definition 1. Let p : [0, 1] — >■ [0, 1] he a nondecreasing function. An efficient p 
weak correlator for concept C is an algorithm ( that takes inputs S and samples 
from T>) such that, for any <5 > 0, any distribution T> over X y.y with y = [0, 1] 
and f{x) = if [j/|a;] G C, with probability 1 — 6 it outputs a hypothesis /i : T — >■ K 
with Pfh P p(o'y). It must run in time polynomial in l/djl/aj,, and size{f), and 
1/p must be polynomial in l/cyf, and size{f). 

The algorithm is very similar. We start with a single leaf £. Repeat: 



1. Sort leaves so that < qi 2 < ■ ■ ■ < qi^- (iV = # of leaves.) 

2. For each £a+i^ ■■■■,£}>, run the weak correlator (for a maximum of T time) 
on the distribution where tab would be the merger of £a---tb- If it 
terminates, the output will be some predictor hab : T — >• K. Choose 1 < a < 
b < N and 6 such that the merge-split of ta ■ ■ - tb with split {hab{x) > 9) 
gives the smallest G{R). 

3. Repeat until the change in G{R) is less than 



The point is that such a weak correlator can be used to get an arbitrarily 
accurate regression graph R with e(i?) < e for any e > 0 (efficiently in 1/e). 
Appendix C shows. 



PRf 



1^ e{R) ^ ^ e{R) 

I i 2 — ^ • 



cr 



/ 



a 



f 



Thus, reducing e{R) to arbitrary inversely polynomial e is equivalent to “boost- 
ing” correlation from inversely polynomial to 1 — e/a^. Appendix C also shows 
PRy = PRfPfy Thus puy, the correlation coefficient reported in so many sta- 
tistical studies, also becomes arbitrarily close to pfy, the optimal correlation 
coefficient. 



Theorem 2. Given a p weak correlator, with probability 1— <5, the learned regres- 
sion graph R has e{R) < e, with runtime polynomial in 1/e, 1/5, and l/p(e/2). 

Proof (sketch). The proof follows that of Lemma 5. There are three differences. 

First, we must have a maximum time restriction on our weak correlators. If 
a leaf has tiny varx)^ (/), then the weak correlator will have to run for a very long 
time, e.g. if in one leaf there are only two types of x, one with /(x) = 0.5 and 
the other with f{x) = 0.49999, then it could easily take the weak correlator a 
long time to correlate with them. However, as seen in the proof of Lemma 5, we 
can safely ignore all leaves with varx>^(/) < e/2. Since we can’t identify them, 
we simply stop each one after a certain amount of time running, for if we’ve 
gone longer than T time (which depends on the runtime guarantees of the weak 




Learning Monotonic Linear Functions 497 



correlator, but is polynomial in 1/e and size(/)), then we know that leaf has low 
variance anyway. 

Second, we estimate weights and values for each different leaf with fresh 
samples. This makes the analysis simple. 

Third, hab is not necessarily a boolean attribute. Fortunately, there is some 
threshold so that I{hab{x) > 0) also has large correlation. The arguments of 
Lemma 5 show there exists an hab with Phatf ^ P(^/2) and cr^ > e/2, which 
are polynomial in 1/e and size(/) by definition of weak correlator. Lemma 6 in 
Appendix B implies that there will be some such threshold indicator h with, 

2 + 2i/2log(2/(p,,^,/cr/))’ 

where quantities are measured over This is nearly ph^^f /2 and its reciprocal 
is certainly inverse polynomial in 1/e and size(/). □ 



7 Conclusions 

While generalized additive models have been studied extensively in statistics, we 
have proven the first efficient learning guarantee, namely that regression graphs 
efficiently learn a generalized additive model (with a monotonic link function) 
to within arbitrary accuracy. 

In the case of classification boosting, most boosting algorithms are parametric 
and maintain a linear combination of weak hypotheses. In fact, if a function is 
boostable, then it is writable as a linear threshold of weak hypotheses (just 
imagine running AdaBoost sufficiently long). We have shown that the class of 
boostable functions in the real valued setting is much richer. It includes at least 
the mono- linear functions of base hypotheses. 

It would be especially nice to remove the dependence on the Lipschitz con- 
stant. (The bounded variation condition does not seem too restrictive.) For the 
related problem of learning a linear threshold function with uniform classifica- 
tion noise, Blum et. al. [4] were able to remove the dependence on a margin that 
was in Bylander’s original work [5]. 

It would be nice to relax the assumption that f{x) = is exactly 

distributed according to a mono-additive function. While it seems difficult to 
provably get as far as one can get in linear regression, i.e. find the best fit linear 
predictor, it may be possible to do something in between. For any given distribu- 
tion there are often several mono-additive functions f{x) that are calibrated with 
the distribution, i.e. f{x) = E[y\f{x)]. For example, the historical probability 
of white winning in a game of chess is almost certainly monotonic in the quan- 
tity Wi - X = (#white pieces) — (#black pieces). But it should also be monotonic 
in terms of something like W 2 ■ x = (^white pawns -|- . . . -I- 3//:white bishops) — 
(#black pawns-!-. . .-l-3#black bishops). Can one do as well as the best calibrated 
mono-additive function without assumptions on T>1 
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A Proof of Lemma 1 



Using the facts that wi = wig -\- wt^ and qg = (wigge^ -\- /we, the change 

in G is. 



AG = wtgqlg + wt^q\ -W£ 



Wjgqig +we,qe^ 
We 



{wig + we^){wigq\ + wi^q\) - {wigqig + wi.qe^f 



wt 



wiowe^igig -qi^f 



wi 



Next, 



covx,,(/,/i) = E-D^[f{x)h{x)] - Ex,i[f{x)]E-D^[h{x)\ 

wigqig + we^qi^ wi^ 

Wt Wt Wt 

_ {wig + we^)we^qi^ - {wigqig -b we^qi^)wi^ 

- 

_ wigWi^{qe^ - qig) 



Meanwhile, since h is boolean, 



va,r Di{^) = P-dAH^) = 0]P-dAH^) = 1] 



Wig Wl^ 
Wl Wl 



WlgWl^ 



w 



2 



Finally, AG = ■u;£COVd,,(/, / i)^/var£)^(/i) = w^corn,, (/, /i)2var_D^ (/). 



□ 
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B Thresholds 

Lemma 6. Let u G [0, 1] be a random variable and w G K 6e o positively cor- 
related random variable. Then there exists some threshold t G such that the 
indicator random variable v± = I{v >t) has correlation near puv > 0, 



Puvt > 



2 + 2i/2 log{2 /{puv(Ju)) 



Proof. WLOG let v be a standard random variable, i.e. pv = 0 and cr„ = 1. The 
main idea is to argue, for r = 2jauv, that 



^uv+dt > 



2 + 2v^21og(l/r) J^r 



(jy^dt. 



This implies that there exists a t G [— r, r] for which the above holds instanta- 
neously, i.e.. 






2 + 2v/2bi(IM ‘ 

^UVt ^ ^UV 

o'uO’vt ct„( 2 -|- 2-\/2log(T7^) 

The above is equivalent to the lemma for r = 2/cr„i, = 2/(/3 „„ct„). Thus it suffices 
to show (2). 

First, a simple translation can bring pu = 0. This will not change any corre- 
lation, so WLOG let us assume that pu = 0 and that u G (—1, 1). This makes 
calculations easier because now = E[uvt] — PuPvt = E[uvt] for all t. Define 
the random variable w by, 

/ r r T if V > T 

vtdt — T = < v ifvG (— T, r) 

[ — T if V < — r 

Then we have, by linearity of expectation, 

/ T pT r pT 

Cuvtdt = / E[uvt]dt = E u / Vtdt = E[u{w t)] = E[uw\. 

Next, notice that |u — w| < |u| and, if w — re yf 0 then |u| > r. This means that 
|u — rc| < u^/t. Gonsequently, E[u{v — w)] < E[\v — w|] < E[v'^/t] = 1/r, so, 

/ T 1 CT 

cTuvtdt = E[uw] = E[uv] — E[u{v — ru)] > E[uv] = (3) 

T 2 

For the second part, by the Gauchy-Schwartz inequality, 

J ay^dt = J ay^Vi . < J j ay^tdt J -dt. 
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Now, = E[vf] — E[vt]'^ = P[v > t]P[v < t]. The above is at most: 

]j P\v>t]tdt-log{l/T) = \j P[v> y/y\]^dy\og{l/T). 
For a nonnegative random variable A, E[A\ = P[A > y]dy. Thus 

/ r^ poo 

P[v > < J P[v'^ > y]dy = E[v'^] = 1. 

By symmetry, we get 

/ ay^dt< [ <1+ \/21og(l/r). 



Equations (4) and (3) imply (2), and we are done. 



C Facts About Regression Graphs 

It is easy to see that jiy = yf = yn = wiqe. Also, 

e(i?) = E[f{xf] + E[R{xf] - 2E[f{x)R{x)] 
= E[f{xY]-E[R{xf]=a)-al 

= - XI 

t 

= a} + y} 

e 



Since ^ = 1, we have ^ aw£ + bw^qi is constant across graphs. So ^ awi + 

bwiqf, — cwiqf for c > 0 as an objective function is equivalent to using — ^ wiqj. 
Finally, cov{R, f) = a = J2weqf ~ Pr = ~ y), implying that prf = 

anlaf = ~ p)/cTf = Jl ~ e{R)/aj 
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Abstract. We study two boosting algorithms. Coordinate Ascent Boost- 
ing and Approximate Coordinate Ascent Boosting, which are explicitly 
designed to produce maximum margins. To derive these algorithms, we 
introduce a smooth approximation of the margin that one can maximize 
in order to produce a maximum margin classifier. Our first algorithm 
is simply coordinate ascent on this function, involving a line search at 
each step. We then make a simple approximation of this line search to 
reveal our second algorithm. These algorithms are proven to asymptot- 
ically achieve maximum margins, and we provide two convergence rate 
calculations. The second calculation yields a faster rate of convergence 
than the first, although the first gives a more explicit (still fast) rate. 
These algorithms are very similar to AdaBoost in that they are based on 
coordinate ascent, easy to implement, and empirically tend to converge 
faster than other boosting algorithms. Finally, we attempt to understand 
AdaBoost in terms of our smooth margin, focusing on cases where Ad- 
aBoost exhibits cyclic behavior. 



1 Introduction 

Boosting is currently a popular and successful technique for classification. The 
first practical boosting algorithm was AdaBoost, developed by Freund and Scha- 
pire [4] . The goal of boosting is to construct a “strong” classifier using only a 
training set and a “weak” learning algorithm. A weak learning algorithm pro- 
duces “weak” classifiers, which are only required to classify somewhat better 
than a random guess. For an introduction, see the review paper of Schapire [13]. 

In practice, AdaBoost often tends not to overfit (only slightly in the limit [5]), 
and performs remarkably well on test data. The leading explanation for Ad- 
aBoost’s ability to generalize is the margin theory. According to this theory, the 
margin can be viewed as a confidence measure of a classifier’s predictive abil- 
ity. This theory is based on (loose) generalization bounds, e.g., the bounds of 
Schapire et al. [14] and Koltchinskii and Panchenko [6]. Although the empirical 

* This research was partially supported by NSF Grants IIS-0325500, DMS-9810783, 
and ANI-0085984. 
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success of a boosting algorithm depends on many factors (e.g., the type of data 
and how noisy it is, the capacity of the weak learning algorithm, the number of 
boosting iterations before stopping, other means of regularization, entire margin 
distribution), the margin theory does provide a reasonable qualitative explana- 
tion (though not a complete explanation) of AdaBoost’s success, both empiri- 
cally and theoretically. However, AdaBoost has not been shown to achieve the 
largest possible margin. In fact, the opposite has been recently proved, namely 
that AdaBoost may converge to a solution with margin significantly below the 
maximum value [11]. This was proved for specific cases where AdaBoost exhibits 
cyclic behavior; such behavior is common when there are very few “support vec- 
tors” . 

Since AdaBoost’s performance is not well understood, a number of other 
boosting algorithms have emerged that directly aim to maximize the margin. 
Many of these algorithms are not as easy to implement as AdaBoost, or re- 
quire a significant amount of calculation at each step, e.g., the solution of a 
linear program (LP- AdaBoost [5]), an optimization over a non-convex function 
(DOOM [7]) or a huge number of very small steps (e-boosting, where conver- 
gence to a maximum margin solution has not been proven, even as the step 
size vanishes [10]). These extra calculations may slow down the convergence 
rate dramatically. Thus, we compare our new algorithms with arc-gv [2] and 
AdaBoost* [9]; these algorithms are as simple to program as AdaBoost and have 
convergence guarantees with respect to the margin. Our new algorithms are more 
aggressive than both arc-gv and AdaBoost*, providing an explanation for their 
empirically faster convergence rate. 

In terms of theoretical rate guarantees, our new algorithms converge to a 
maximum margin solution with a polynomial convergence rate. Namely, within 
poly{l/e) iterations, they produce a classifier whose margin is within e of the 
maximum possible margin. Arc-gv is proven to converge to a maximum margin 
solution asymptotically [2,8], but we are not aware of any proven convergence 
rate. AdaBoost* [9] converges to a solution within e of the maximum margin in 
2 (log 2 m)je^ steps (where the user specifies a fixed value of e); there is a tradeoff 
between user-determined accuracy and convergence rate for this algorithm. In 
practice, AdaBoost* converges very slowly since it is not aggressive; it takes 
small steps (though it has the nice convergence rate guarantee stated above). In 
fact, if the weak learner always finds a weak classifier with a large edge (i.e., if 
the weak learning algorithm performs well on the weighted training data), the 
convergence of AdaBoost* can be especially slow. 

The two new boosting algorithms we introduce (which are presented in [12] 
without analysis) are based on coordinate ascent. For AdaBoost, the fact that it 
is a minimization algorithm based on coordinate descent does not imply conver- 
gence to a maximum margin solution. For our new algorithms, we can directly 
use the fact that they are coordinate ascent algorithms to help show convergence 
to a maximum margin solution, since they make progress towards increasing a 
differentiable approximation of the margin (a “smooth margin function”) at ev- 
ery iteration. 
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To summarize, the advantages of our new algorithms, Coordinate Ascent 
Boosting and Approximate Coordinate Ascent Boosting are as follows: 

— They empirically tend to converge faster than both arc-gv and AdaBoost*. 

— They provably converge to a maximum margin solution asymptotically. This 
convergence is robust, in that we do not require the weak learning algorithm 
to produce the best possible classifier at every iteration; only a sufficiently 
good classifier is required. 

— They have convergence rate guarantees that are polynomial in 1 /e. 

— They are as easy to implement as AdaBoost, arc-gv, and AdaBoost*. 

— These algorithms have theoretical and intuitive justification: they make pro- 
gress with respect to a smooth version of the margin, and operate via coor- 
dinate ascent. 

Finally, we use our smooth margin function to analyze AdaBoost. Since Ad- 
aBoost’s good generalization properties are not completely explained by the 
margin theory, and still remain somewhat mysterious, we study properties of 
AdaBoost via our smooth margin function, focusing on cases where cyclic behav- 
ior occurs. “Cyclic behavior for AdaBoost” means the weak learning algorithm 
repeatedly chooses the same sequence of weak classifiers, and the weight vectors 
repeat with a given period. This has been proven to occur in special cases, and 
occurs often in low dimensions (i.e., when there are few “support vectors”) [11]. 

Our results concerning AdaBoost and our smooth margin are as follows: first, 
the value of the smooth margin increases if and only if AdaBoost takes a large 
enough step. Second, the value of the smooth margin must decrease for at least 
one iteration of a cycle unless all edge values are identical. Third, if all edges in 
a cycle are identical, then support vectors are misclassified by the same number 
of weak classifiers during the cycle. 

Here is the outline: in Section 2, we introduce our notation and the AdaBoost 
algorithm. In Section 3, we describe the smooth margin function that our algo- 
rithms are based on. In Section 4, we describe Coordinate Ascent Boosting (Al- 
gorithm 1) and Approximate Coordinate Ascent Boosting (Algorithm 2), and in 
Section 5, the convergence of these algorithms is discussed. Experimental trials 
on artificial data are presented in Section 6 to illustrate the comparison with 
other algorithms. In Section 7, we show connections between AdaBoost and our 
smooth margin function. 



2 Notation and Introduction to AdaBoost 

The training set consists of examples with labels {(x^, where (xj, pi) 

G T X {—1,1}. The space X never appears explicitly in our calculations. Let 
H — {hi, ..., hn} be the set of all possible weak classifiers that can be produced 
by the weak learning algorithm, where hj : X ^ (1,— 1}. We assume that if 
hj appears in H, then —hj also appears in H (i.e., H is symmetric). Since our 
classifiers are binary, and since we restrict our attention to their behavior on 
a finite training set, we can assume that n is finite. We think of n as being 
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large, m <C n, so a gradient descent calculation over an n dimensional space is 
impractical; hence AdaBoost uses coordinate descent instead, where only one 
weak classifier is chosen at each iteration. 

We define an mxn matrix M where Mij = yihj(xi), i.e., Mij = +1 if training 
example i is classified correctly by weak classifier hj, and —1 otherwise. We 
assume that no column of M has all +l’s, that is, no weak classifier can classify 
all the training examples correctly. (Otherwise the learning problem is trivial.) 
Although M is too large to be explicitly constructed in practice, mathematically, 
it acts as the only “input” to AdaBoost, containing all the necessary information 
about the weak learner and training examples. 

AdaBoost computes a set of coefficients over the weak classifiers. The (unnor- 
malized) coefficient vector at iteration t is denoted A(. Since the algorithms we 
describe all have positive increments, we take A G K” . We define a seminorm by 
|||A||| := min^' III A' 111 such that Vj : — A:- = A' — Af} where j is the index for 

—hj, and define s(A) := Aj, noting s(A) > |||A|||. For the (non-negative) 

vectors Xt generated by AdaBoost, we will denote St '■= s{Xt). The final com- 
bined classifier that AdaBoost outputs is fAda = Z)”=i(At™ax j/ll|At,„„„|||)^j. 
The margin of training example i is defined to be yifAdafx-i), or equivalently, 
(MA),/|||A|||. 

A boosting algorithm maintains a distribution, or set of weights, over the 
training examples that is updated at each iteration, which is denoted dj G 
and is its transpose. Here, denotes the simplex of m-dimensional vectors 
with non-negative entries that sum to 1. At each iteration t, a weak classifier 
hj^ is selected by the weak learning algorithm. The probability of error of hj^ at 
time t on the weighted training examples is d- := =-i} Also, denote 

d_i_ := 1 — d_, and define 1+ := {i : Mij^ = -1-1} and := {i : Mij^ = —1}. Note 

that d-,I+, and I- depend on t; the iteration number will be clear from the 
context. The edge of weak classifier jt at time t is r* := (d^M)jj, which can be 
written as rt = dtd = d+ — d- = 1 — 2d-. Thus, 

a smaller edge indicates a higher probability of error. Note that = (1 -I- rt)/2 
and d_ = (1 — rt)/2. Also define 7 t := tanh“^ rt- 

We wish our learning algorithms to have robust convergence, so we will not 
require the weak learning algorithm to produce the weak classifier with the 
largest possible edge value at each iteration. Rather, we only require a weak 
classifier whose edge exceeds p, where p is the largest possible margin that can 
be attained for M, i.e., we use the “non-optimal” case for our analysis. AdaBoost 
in the “optimal case” means jt G argmax^-(d^M)j , and AdaBoost in the “non- 
optimal” case means jt G {j : (dfM)j > P\- 

To achieve the best indication of a small probability of error (for margin-based 
bounds), our goal is to find a A G A„ that maximizes the minimum margin over 
training examples, min^ (MA)i (or equivalently min^ 7/i/^da(xi)), i.e., we wish 
to find a vector A G argmax^g^^ mini(MA)i = argmax^^^jgn mini(MA)i/|||A|||. 
We call the minimum margin over training examples (i.e., mini(MA)i/|||A|||) 
the margin of classifier A, denoted p,{X). Any training example that achieves 
this minimum margin is a support vector. Due to the von Neumann Min-Max 
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Theorem, miiidezim maxj(d^M)j = max^g^^ miiii(MA)i. We denote this value 
by P- 

Figure 1 shows pseudocode for AdaBoost. At each iteration, the distribution 
d( is updated and renormalized (Step 3a), classifier jt with sufficiently large edge 
is selected (Step 3b), and the weight of that classifier is updated (Step 3e). 



1. Input: Matrix M, No. of iterations tmax 

2. Initialize: Aij = 0 for j = 1, n 

3. Loop for t=l, ..., tmax 

a) dt,i = ^ e-(MXOj fo, j ^ 



b) V'] 

t 7 



jt G argmaXj (d^M) j “optimal” case 

3t G {j : (d^M) j > p} “non-optimal” case 



c) n = (di M),, 

d) a, = iln(i^) 

e) At+i = At + cttGjt, where is 1 in position jt and 0 elsewhere. 
4. Output: At„„,,/|||At„,„,.||| 

Fig. 1. Pseudocode for the AdaBoost algorithm. 



—dF{Xt + aej)/da 

' a— 0 

iteration t is 



AdaBoost is known to be a coordinate descent algorithm for minimizing 
•= [!]■ The proof (for the optimal case) is that the choice 

of weak classifier jt is given by: jt G argmax^- 

argmaXj (d^M)j, and the step size AdaBoost chooses at iteration t is at, 
where at satisfies the equation for the line search along direction jt- 0 = 
—dF{\t + atej^)/dat- Convergence in the non-separable case is fully under- 
stood [3]. In the separable case (p > 0), the minimum value of F’ is 0 and occurs 
as |||A||| — >■ oo; this tells us nothing about the value of the margin, i.e., an al- 
gorithm which simply minimizes F can achieve an arbitrarily bad margin. So it 
must be the process of coordinate descent which awards AdaBoost its ability to 
increase margins, not simply AdaBoost’s ability to minimize F. 



3 The Smooth Margin Function G{\) 



We wish to consider a function that, unlike F, actually tells us about the value 
of the margin. Our new function G is defined for A G K” , s(A) > 1 by: 



G(A) := 



-lnF(A) _ -ln(E™ie-(^^b) 



.(A) 






( 1 ) 



One can think of G as a smooth approximation of the margin, since it depends 
on the entire margin distribution when s(A) is finite, and weights training exam- 
ples with small margins much more highly than examples with larger margins. 
The function G also bears a resemblance to the objective implicitly used for e- 
boosting [10]. Note that since s(A) > |||A|||, we have G(A) < — (In A(A))/|||A|||. 
Lemma 1 (parts of which appear in [12]) shows that G has many nice properties. 
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Lemma 1. 

1. G{\) is a concave function (but not necessarily strictly concave) in each 
“shell” where s(A) is fixed. In addition, G{\) becomes concave when s(A) 
becomes large. 

2. G{\) becomes concave when |||A||| becomes large. 

3. vis |||A||| ^ oo, -(lnF(A))/|||A||| ^ 

4 . The value of G{X) increases radially, i.e., dG(A(l + a))/da > 0 

a=0 

It follows from 3 and 4 that the maximum value of G is the maximum value 
of the margin, since for each A, we may construct a A' such that G(A') = 
— lni^(A)/|||A|||. We omit the proofs of 1 and 4. Note that if |||A||| is large, s(A) 
is large since |||A||| < s(A). Thus, 2 follows from 1. 

Proof, (of property 3) 

m m 

J^g-M(A)|||A||| ^^g-minKMA)f >^g-(MA)i ^ mindMA)^ ^ g-/x(A) 1 1 1 A| 1 1 ^ 
i=l i=l 

hence, - (lnTO)/|||A||| + /x(A) < -(lnF(A))/|||A||| < ^(A). (2) 

□ 

The properties of G shown in Lemma 1 outline the reasons why we choose to 
maximize G using coordinate ascent ; namely, maximizing G leads to a maximum 
margin solution, and the region where G is near its maximum value is concave. 



4 Derivation of Algorithms 

We now suggest two boosting algorithms (derived without analysis in [12]) that 
aim to maximize the margin explicitly (like arc-gv and AdaBoost*) and are 
based on coordinate ascent (like AdaBoost). Our new algorithms choose the 
direction of ascent (value of jt) using the same formula as AdaBoost, arc-gv, 
and AdaBoost*, i.e., jt G argmax^(d^M)j. Thus, our new algorithms require 
exactly the same type of weak learning algorithm. 

To help with the analysis later, we will write recursive equations for F and 
G. The recursive equation for F (derived only using the definition) is: 

By definition of G, we know — lnF(A() = StG{\t) and — lnF(A( -|- aejJ = 
(st -I- a)G{\t + From (3), we find a recursive equation for G: 

(st -I- q;)G( A( -I- o;e, ) = — In F(A() —Inf ^ ^)=SiG(A() -f [ tanh. u du. 

V coshyt J 
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We shall look at two different algorithms; in the first, we assign to at the 
value a that maximizes G'(At+aejj), which requires solving an implicit equation. 
In the second algorithm, inspired by the first, we pick a value for at that can 
be computed in a straightforward way, even though it is not a maximizer of 
G{\t + oejj- In both cases, the algorithm starts by simply running AdaBoost 
until G(A) becomes positive, which must happen (in the separable case) since: 

Lemma 2. In the separable case (where p > Q), AdaBoost achieves a positive 
value for G{Xt) in at most 21ni^(Ai)/ln(l — + 1 iterations. 

The proof of Lemma 2 (which is omitted) uses (3). Denote a!|^^, ..., a[^^ to be 
a sequence of coefficient vectors generated by Algorithm 1, and a!|^^, ..., a[^^ to 
be generated by Algorithm 2. Similarly, we distinguish sequences a\^^ and 
gfl := G(Ap'), := G(Ap'), s™, and sf'. Sometimes we compare the behavior 

of Algorithms 1 and 2 based on one iteration (from t to t + 1) as if they had 
started from the same coefficient vector at iteration t; we denote this vector by 
A(. When both Algorithms 1 and 2 satisfy a set of equations, we will remove 
the superscripts 1^1 and [^1. Although sequences such as jt, rt, 7 t, and dt are also 
different for Algorithms 1 and 2, we leave the notation without the superscript. 



4.1 Algorithm 1: Coordinate Ascent Boosting 



Rather than considering coordinate descent on F as in AdaBoost, let us consider 
coordinate ascent on G. In what follows, we will use only positive values of G, as 
we have justified above. The choice of direction jt at iteration t (in the optimal 



case) obeys: jt £ 



argmax(iG(Ap^ + aej)/dc 



, that is, 

ct=0 



jt € 



argmax 

3 







ln(f(AW)) 

(4-r 



Of these two terms on the right, the second term does not depend on j, and 
the first term is simply a constant times (d^M)j. Thus the same direction will 
be chosen here as for AdaBoost. The “non-optimal” setting we define for this 
algorithm will be the same as AdaBoost ’s, so Step 3b of this new algorithm will 
be the same as AdaBoost’s. 

To determine the step size, ideally we would like to maximize G(a[^^ + aejf) 
with respect to a, i.e., we will define to obey dG(Ap^ + aejf)/da = 0 for 
a = Differentiating (4) with respect to a (while incorporating dG(Ap^ + 
aejf)/da = 0) gives the following condition for app 

G(aWJ = G(Ap' + = tanh(7t - af'). (5) 



There is not a nice analytical solution for a\^\ but minimization of G(a[^^ + 
aej^) is 1-dimensional so it can be performed quickly. Hence we have defined 
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the first of our new boosting algorithms: coordinate ascent on G, implementing 
a line search at each iteration. To clarify the line search step at iteration t using 
(5) and (4), we use 7t, and to solve for that satisfies: 



sWG(AW) + ln 



cosh 7t 



\cosh(7t - af') 



= + af') tanh(7t - a™). (6) 



Summarizing, we define Algorithm 1 as follows: 

— First, use AdaBoost (Figure 1) until G(a|^^) defined by (1) is positive. At this 
point, replace Step 3d of AdaBoost as prescribed: equals the (unique) 

solution of (6). Proceed, using this modified iterative procedure. 

Let us rearrange the equation slightly. Using the notation := G(a|^;^) 
in (5), we find that satisfies the following (implicitly): 

=7t — tanh“^(5|J^^) = tanh“^ r* — tanh“^((/|5^;^) = ^In 

( 7 ) 



f + 7 9t+i 



For any A G M", from (2) and since |||A||| < s(A), we have G(A) < p. Con- 
sequently, < P < ft, so is strictly positive. On the other hand, since 

G(a[ 1 ^;^) > G(a|^^), we again have G(a[^;^) > 0, and thus < 74. 



4.2 Algorithm 2: Approximate Coordinate Ascent Boosting 

The second of our two new boosting algorithms avoids the line search of Al- 
gorithm 1, and is even slightly more aggressive. It performs very similarly to 
Algorithm 1 in our experiments. To define this algorithm, we consider the fol- 
lowing approximate solution to the maximization problem (5): 

G(a[^^) = tanh(7t — af^), or more explicitly, (8) 



[ 2 ] 

al ‘ = 



7t — tanh ^((/|^^) =tanh ^ r* — tanh ^{gf^) 



l + nl- g't 



[ 2 ]' 



9i 



[ 2 ] 



• ( 9 ) 



This update still yields an increase in G. (This can be shown using (4) and 
the monotonicity of tanh.) Summarizing, we define Algorithm 2 as the iterative 
procedure of AdaBoost (Figure 1) with one change: 



— Replace Step 3d of AdaBoost as follows: 



a 




l + nl- gf ^ \ 



:= max{0,G(Af')}, 
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where G is defined in (1). (Note that we could also have written the procedure 
in the same way as for Algorithm 1. As long as G(a[^^) < 0, this update is the 
same as in AdaBoost.) 

Algorithm 2 is slightly more aggressive than Algorithm 1, in the sense that 
it picks a larger relative step size at, albeit not as large as the step size defined 
by AdaBoost itself. If Algorithm 1 and Algorithm 2 were started at the same 
position A(, with gt := G{\t), then Algorithm 2 would always take a slightly 
larger step than Algorithm 1; since > gt, we can see from (7) and (9) that 

As a remark, if we use the updates of Algorithms 1 or 2 from the 
start, they would also reach a positive margin quickly. In fact, after at most 
'~21nF(Ai)/[— In (I — + ln(I — G(Ai))]~'+ I iterations, G(A() would have a 

positive value. 



5 Convergence of Algorithms 



We will show convergence of Algorithms 1 and 2 to a maximum margin solution. 
Although there are many papers describing the convergence of specific classes 
of coordinate descent/ascent algorithms (e.g., [15]), this problem did not fit into 
any of the existing categories. The proofs below account for both the optimal 
and non-optimal cases, and for both algorithms. 

One of the main results of this analysis is that both algorithms make signif- 
icant progress at each iteration. In the next lemma, we only consider one incre- 
ment, so we fix At at iteration t and let gt '■= G(At), St '■= Then, denote 

g\h ■= G(At-l-a[ ^), g\h := G(At-l-a[ '), 4+i St + a\ and := St-l-a[ 



Lemma 3. 



4+1 -9t> 






2s. 



[ 1 ] 

t+i 



and 



4+1 -9t> 






2s 



[ 2 ] 

t+1 



Proof. We start with Algorithm 2. First, we note that since tanh is concave on 
M_i_, we can lower bound tanh on an interval (a, b) C (0, oo) by the line connecting 
the points (a,tanh(a)) and (6, tanh(&)). Thus, 



r7t 



1 



/ tanhu du > -at 



[ 2 ] 



tanh 7 t-btanh( 7 t-af') =-af\rt+gt), (10) 



where the last equality is from (8). Combining (10) with (4) yields: 
s[4i44i ^ ^t9t + + 9t), thus s[4i(4+i “ 9t) + affft > + 9t), 




Boosting Based on a Smooth Margin 



511 



and the statement of the lemma follows (for Algorithm 2). By definition, 
is the maximum value of G{\t + ctej^), so > gj+i- Because a/{s + a) = 
1 — s/(q; + s) increases with a and since , 



n > n > 

fft+1 - 9t ^ 9t+i - 9t ^ 



af \ in -9t) ^ \ in - 9t) 



[ 1 ] 



s'"' 

^t+i, 



~ \ 

\*i+l , 



Another important ingredient for our convergence proofs is that the step size 
does not increase too quickly; this is the main content of the next lemma. We 
now remove superscripts since each step holds for both algorithms. 



Lemma 4. limt_>oo ctt/st+i —1 0 for both Algorithms 1 and 2. 



If limi_>oo St is finite, the statement can be proved directly. If limt_>oo St = oo> 
our proof (which is omitted) uses (4), (5) and (8). 

At this point, it is possible to use Lemma 3 and Lemma 4, to show asymptotic 
convergence of both Algorithms 1 and 2 to a maximum margin solution; we defer 
this calculation to the longer version. In what follows, we shall prove two different 
results about the convergence rate. The first theorem gives an explicit a priori 
upper bound on the number of iterations needed to guarantee that g^t^ or g^^^ is 
within e > 0 of the maximum margin p. As is often the case for uniformly valid 
upper bounds, the convergence rate provided by this theorem is not optimal, in 
the sense that faster decay oi p — gt can be proved for large t if one does not 
insist on explicit constants. The second convergence rate theorem provides such 
a result, stating that p — gt = O or equivalently p — gt < e after 

0(g-(3+<5)) iterations, where 5 > 0 can be arbitrarily small. 

Both convergence rate theorems rely on estimates limiting the growth rate 
of at- Lemma 4 is one such estimate; because it is only an asymptotic estimate, 
our first convergence rate theorem requires the following uniformly valid lemma. 

Lemma 5. 

< Cl + C2Sp^ and aF < ci + C2sf‘\ where ci = and C2 = — . (11) 

1 - p 1 - p 

Proof. Consider Algorithm 2. From (4), 

sfhgfh - = In cosh 7 t - lncosh( 7 t - af'). 

Because ^ (e^ + e“‘>) = cosh^ < for ^ > 0, we have ^ — ln2 < 

In cosh ^ < f. Now, 

- sf'pf' > 7 i - ln2 - ( 7 t - af'), so 
af ' (1 -p)< af' (1 - ^i+i) < In 2 + sf' - pf) < In 2 + psf' . 

Thus we directly find the statement of the lemma for Algorithm 2. A slight 
extension of this argument proves the statement for Algorithm 1. □ 




512 C. Rudin, R.E. Schapire, and I. Daubechies 



Theorem 1. (first convergence rate theorem) Suppose R < 1 is known to be an 
upper bound for p. Let 1 be the iteration at which G becomes positive. Then both 
the margin p,{\t) and the value ofG{\t) will be within e of the maximum margin 
p within at most 

i + 1 + '"(sj + In 2) e“(3--R)/(i--R)n iterations, for both Algorithms 1 and 2. 



Proof. Define Z\G(A) := p — G{\). Since (2) tells us that 0 < p — p.{\t) < 
p — G{Xt) = AG{Xt), we need only to control how fast AG{Xt) — >■ 0 as t — >■ oo. 
That is, if G(A() is within e of the maximum margin p, so is the margin /i(Aj). 
Starting from Lemma 3, 



P - 9t+i <p-gt- 

AG{Xt+i) < AG{Xt) 



at 



2s 



t+i 



(rt- p + p- gt), 



1 - 






2s 



t+i. 



atin - p) 

2st+i 



thus 



<ZlG(Ai)n 1 

l=l ^ 



at 

2st+i 



.( 12 ) 



We stop the recursion at Aj, where Aj is the coefficient vector at the first iteration 
where G is positive. We upper bound the product in (12) using Lemma 5. 



n 

1=1 



1 - 



at 

2st+i 



n 

1=1 



^ _ 1 St+l — St 

2 st+i 



< exp 






st+i — St 



2 st+i 
1=1 



1 ‘ 

- 1 : 



st+l — St 



St ■ 



:St ■ 



In 2 

1-p 



< exp 

< exp 

It follows from (12) and (13) that 

St < St + In 2 < (sj + In 2) 



= exp 



1-p 



E 



1-p dv 




sj + In 2 


2 isj u + In 2 




_Sf+i + In 2 



1=1 

(l-p)/2 



St+l — St 
s^ + In 2 



AG{Xt)_ 



(13) 



(14) 



On the other hand, using some trickery one can show that for all t, for both 
algorithms, at > (L\G(At+i))/(l — ppj), which implies: 

,^G(At) 



St > sj + (t - 1) 



1 - P9l 



(15) 



Combining (14) with (15) leads to: 



, T ^ (l-Pffi)sf ^ (l-pffi)(si+ln2)[Z\G(Aj)]^/^^ 

- Z\G(At) - [ZiG(At)]^+'^/(^-^)] ’ ^ ^ 

which means AG{Xt) > e is possible only if t < 1 + (sj + In 
Therefore, Z\(G(At) < e whenever t exceeds 

i + 1 + (sj + ln2)e-(^--^)/(i-'^) > i + 1 + (si + In □ 
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In order to apply the proof of Theorem 1, one has to have an upper bound for 
p, which we have denoted by R. This we may obtain in practice via the minimum 
achieved edge R = min^<i ri < 1. 

An important remark is that the technique of proof of Theorem 1 is much 
more widely applicable. In fact, this proof used only two main ingredients: 
Lemma 3 and Lemma 5. Inspection of the proof shows that the exact values 
of the constants occurring in these estimates are immaterial. Hence, Theorem 1 
may be used to obtain convergence rates for other algorithms. 

The convergence rate provided by Theorem 1 is not tight; our algorithms 
perform at a much faster rate in practice. The fact that the step-size bound in 
Lemma 5 holds for all t allowed us to find an upper bound on the number of 
iterations; however, we can find faster convergence rates in the asymptotic regime 
by using Lemma 4 instead. The following lemma holds for both Algorithms 1 
and 2. The proof, which is omitted, follows from Lemma 3 and Lemma 4. 

Lemma 6. For any Q < v < 1/2, there exists a constant such that for all 
t > 1 (i.e., all iterations where G is positive), p — gt If 

Theorem 2. (second convergence rate theorem) For both Algorithms 1 and 2, 
and for any 5 > Q, a margin within e of optimal is obtained after at most 
0(£-(3-i-<5)) iterations from the iteration 1 where G becomes positive. 

Proof. By (15), we have t — 1 < (1 — pgi){p — gt)~^{st — sj). Combining this 
with Lemma 6 leads to t — 1 < (1 — pgi)Gl^'^{p — For d > 0, 

we pick V = Vs '.= 1/(2 -I- <5) < 1/2, and we can rewrite the last inequality 
as: {p - gtf+^ < (1 - pgi)Glf{t - l)-i, or p - gt < C's{t - l)-i/(3+5)^ with 
Cg = {1 — It follows that p—p,{\t) < p — gt < e whenever 

t — i > which completes the proof of Theorem 2. □ 

Although Theorem 2 gives a better convergence rate than Theorem 1 since 
3 < l-|-2/(l — p), there is an unknown constant Cg, so that this estimate cannot 
be translated into an a priori upper bound on the number of iterations after 
which p — gt < e is guaranteed, unlike Theorem 1 . 

6 Simulation Experiments 

The updates of Algorithm 2 are less aggressive than AdaBoost’s, but slightly 
more aggressive than the updates of arc-gv, and AdaBoost*. Algorithm 1 seems 
to perform very similarly to Algorithm 2 in practice, so we use Algorithm 2. This 
section is designed to illustrate our analysis as well as the differences between 
the various coordinate boosting algorithms; in order to do this, we give each 
algorithm the same random input, and examine convergence of all algorithms 
with respect to the margin. Experiments on real data are in our future plans. 

Artificial test data for Figure 2 was designed as follows: 50 examples were con- 
structed randomly such that each lies on a corner of the hypercube {—1,1}^'^*'. 
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We set yi = sign(^^^^ Xi(fc)), where Xi(fc) indicates the fc**' component of x^. 
The j**' weak learner is hj{x.) = x(j), thus Mij = yiX.i{j). To implement the 
“non-optimal” case, we chose a random classifier from the set of sufficiently 
good classifiers at each iteration. 

We use the definitions of arc-gv and AdaBoost* found in Meir and Ratsch’s 
survey [8]. AdaBoost, arc-gv. Algorithm 1 and Algorithm 2 have initially large 
updates, based on a conservative estimate of the margin. AdaBoost* ’s updates 
are initially small based on an overestimate of the margin. 

AdaBoost’s updates remain consistently large, causing A( to grow quickly and 
causing fast convergence with respect to G. AdaBoost seems to converge to the 
maximum margin in (a); however, it does not seem to in (b), (d) or (e). Algorithm 
2 converges fairly quickly and dependably; arc-gv and AdaBoost* are slower 
here. We could provide a larger value of v in AdaBoost* to encourage faster 
convergence, but we would sacrifice a guarantee on accuracy. The more “optimal” 
we choose the weak learners, the better the larger step-size algorithms (AdaBoost 
and Algorithm 2) perform, relative to AdaBoost*; this is because AdaBoost* ’s 
update uses the minimum achieved edge, which translates into smaller steps 
while the weak learning algorithm is doing well. 




Fig. 2. AdaBoost, AdaBoost* (parameter v set to .001), arc-gv, and Algorithm 2 on 
synthetic data, (a- Top Left) Optimal case. (b-Top Right) Non-optimal case, using the 
same 50 x 100 matrix M as in (a). (c-Bottom Left) Optimal case, using a different 
matrix. (d-Bottom Right) Non-optimal case, using the same matrix as (c). 
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7 A New Way to Measure AdaBoost’s Progress 

AdaBoost is still a mysterious algorithm. Even in the optimal case it may con- 
verge to a solution with margin significantly below the maximum [11]. Thus, 
the margin theory only provides a significant piece of the puzzle of AdaBoost’s 
strong generalization properties; it is not the whole story [5,2,11]. Hence, we give 
some connections between our new algorithms and AdaBoost, to help us under- 
stand how AdaBoost makes progress. In this section, we measure the progress of 
AdaBoost according to something other than the margin, namely, our smooth 
margin function G. First, we show that whenever AdaBoost takes a large step, 
it makes progress according to G. We use the superscript ["^1 for AdaBoost. 

Theorem 3. G(A|fi) > G(a[^') T(rt) > G(a[^'), where T : (0,1) ^ 

(0,oo) is a monotonically increasing function. 

In other words, G(a[i|\) > G(a['^^) if and only if the edge rt is sufficiently large. 

Proof. Using AdaBoost’s update = jt, G(a["^^) < G(a[(^\) if and only if: 

w 

(4^1 + al^’)G(A[^') < (4^1 + a[^’)G(Al4\) = 4 ^’g(a[^1) + T‘ du, 

Jo 

[A] 

1 

i.e., G(a["^^) < — / tanh u du, 

a[ Jo 

where we have used (4). We denote the expression on the right hand side by 
T{rt), which can be rewritten as: T{rt) := — In (l — r^) j In . Since T(r) 

is monotonically increasing in r, our statement is proved. □ 

Hence, AdaBoost makes progress (measured by G) if and only if it takes a big 
enough step. Figure 3, which shows the evolution of the edge values, illustrates 
this. Whenever G increased from the current iteration to the following iteration, 
a small dot was plotted. Whenever G decreased, a large dot was plotted. The fact 
that the larger dots are below the smaller dots is a direct result of Theorem 3. 
In fact, one can visually track the progress of G using the boundary between the 
larger and smaller dots. 

AdaBoost’s weight vectors often converge to a periodic cycle when there are 
few support vectors [11]. Where Algorithms 1 and 2 make progress with respect 
to G at every iteration, the opposite is true for cyclic AdaBoost, namely that 
AdaBoost cannot increase G at every iteration, by the following: 

Theorem 4. If AdaBoost’s weight vectors converge to a cycle of length T iter- 
ations, the cycle must obey one of the following conditions: 

1. the value of G decreases for at least one iteration within the cycle, or 

2. the value of G is constant at every iteration, and the edge values in the cycle 

squal. 




516 C. Rudin, R.E. Schapire, and I. Daubechies 



0.85 




Fig. 3. Value of the edge at each iteration t, for a run of AdaBoost using the 12 x 25 
matrix M shown (black is -1, white is +1). AdaBoost alternates between chaotic and 
cyclic behavior. For further explanation of the interesting dynamics in this plot, see [11]. 



In other words, the value of G cannot be strictly increasing within a cycle. 
The main ingredients for the proof (which is omitted) are Theorem 3 and (4). 
For specific cases that have been studied [11], the value of G is non-decreasing, 
and the value of rt is the same at every iteration of the cycle. In such cases, a 
stronger equivalence between support vectors exists here; they are all “viewed” 
similarly by the weak learning algorithm, in that they are misclassified the same 
proportion of the time. (This is surprising since weak classifiers may appear more 
than once per cycle.) 

Theorem 5. Assume AdaBoost cycles. If all edges are the same, then all sup- 
port vectors are misclassified by the same number of weak classifiers per cycle. 

Proof. Let rt =: r which is constant. Consider support vectors i and i'. All 
support vectors obey the cycle condition [11], namely: n^i(l + ~ 

nLi(i -b Mi'j^r) = l. Define r* := |{t : = 1}|, the number of times example 

i is correctly classified during one cycle of length T. Now, 1 — rit=i(l + -^dt’") — 
(1 -b r)”’(l — r)^“”‘ = (1 -b r)”*' (1 — . Hence, Ti = Ti>. Thus, example i is 

misclassified the same number of times that i' is misclassified. Since the choice 
of i and i' were arbitrary, this holds for all support vectors. □ 
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Abstract. In connection with two-label classification tasks over the 
Boolean domain, we consider the possibility to combine the key advan- 
tages of Bayesian networks and of kernel-based learning systems. This 
leads us to the basic question whether the class of decision functions 
induced by a given Bayesian network can be represented within a low- 
dimensional inner product space. For Bayesian networks with an explic- 
itly given (full or reduced) parameter collection, we show that the “natu- 
ral” inner product space has the smallest possible dimension up to factor 
2 (even up to an additive term 1 in many cases). For a slight modification 
of the so-called logistic autoregressive Bayesian network with n nodes, we 
show that every sufficiently expressive inner product space has dimen- 
sion at least 2"^^. The main technical contribution of our work consists 
in uncovering combinatorial and algebraic structures within Bayesian 
networks such that known techniques for proving lower bounds on the 
dimension of inner product spaces can be brought into play. 



1 Introduction 

During the last decade, there has been a lot of interest in learning systems 
whose hypotheses can be written as inner products in an appropriate feature 
space, trained with a learning algorithm that performs a kind of empirical or 
structural risk minimization. The inner product operation is often not carried 
out explicitly, but reduced to the evaluation of a so-called kernel-function that 
operates on instances of the original data space, which offers the opportunity 
to handle high-dimensional feature spaces in an efficient manner. This learning 
strategy introduced by Vapnik and co-workers [4,33] in connection with the so- 
called Support Vector Machine is a theoretically well founded and very powerful 
method that, in the years since its introduction, has already outperformed most 
other systems in a wide variety of applications. 

Bayesian networks have a long history in statistics, and in the first half of 
the 1980s they were introduced to the field of expert systems through work by 
Pearl [25] and Spiegelhalter and Knill- Jones [29]. They are much different from 

* This work has been supported in part by the Deutsche Forschungsgemeinschaft 
Grant SI 498/7-1. 
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kernel-based learning systems and offer some complementary advantages. They 
graphically model conditional independence relationships between random vari- 
ables. There exist quite elaborated methods for choosing an appropriate network, 
for performing probabilistic inference (inferring missing data from existing ones), 
and for solving pattern classification tasks or unsupervised learning problems. 
Like other probabilistic models, Bayesian networks can be used to represent in- 
homogeneous training samples with possibly overlapping features and missing 
data in a uniform manner. 

Quite recently, several research groups considered the possibility to combine 
the key advantages of probabilistic models and kernel-based learning systems. 
For this purpose, several kernels (like the Fisher- kernel, for instance) were stud- 
ied extensively [17,18,24,27,31,32,30]. Altun, Tsochantaridis, and Hofmann [1] 
proposed (and experimented with) a kernel related to the Hidden Markov Model. 

In this paper, we focus on two-label classification tasks over the Boolean do- 
main and on probabilistic models that can be represented as Bayesian networks. 
Intuitively, we aim at finding the “simplest” inner product space that is able 
to express the class of decision functions (briefly called “concept class” in what 
follows) induced by a given Bayesian network. We restrict ourselves to Euclidean 
spaces equipped with the standard scalar product.^ Furthermore, we use the Eu- 
clidean dimension of the space as our measure of simplicity.^ Our main results 
are as follows: 

1) For Bayesian networks with an explicitly given (full or reduced) parameter 
collection, the “natural” inner product space (obtained from the probabilistic 
model by fairly straightforward algebraic manipulations) has the smallest possi- 
ble dimension up to factor 2 (even up to an additive term 1 in many cases) . The 
(almost) matching lower bounds on the smallest possible dimension are found 
by analyzing the VC-dimension of the concept class associated with a Bayesian 
network. 

2) We present a quadratic lower bound and the upper bound 0(n®) on the 
VC-dimension of the concept class associated with the so-called “logistic autore- 
gressive Bayesian network” (also known as “sigmoid belief network”)^, where n 
denotes the number of nodes. 

3) For a slight modification of the logistic autoregressive Bayesian network with 
n -|- 2 nodes, we show that every sufficiently expressive inner product space has 
dimension at least 2"/"*. The proof of this lower bound proceeds by showing that 

^ This is no loss of generality (except for the infinite-dimensional case) since any finite- 
dimensional reproducing kernel Hilbert space is isometric with for some d. 

^ This is well motivated by the fact that most generalization error bounds for linear 
classifiers are given in terms of either the Euclidean dimension or in terms of the ge- 
ometrical margin between the data points and the separating hyperplanes. Applying 
random projection techniques from [19,14,2], it can be shown that any arrangement 
with a large margin can be converted into a low-dimensional arrangement. Thus, a 
large lower bound on the smallest possible dimension rules out the possibility of a 
large margin classifier. 

® originally proposed by Mc-Cullagh and Nelder [7] and studied systematically, for 
instance, by Neal [23] and by Saul, Jaakkola, and Jordan [26] 
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the concept class induced by such a network contains exponentially many deci- 
sion functions that are pairwise orthogonal on an exponentially large subdomain. 
Since the VC-dimension of this concept class has the same order of magnitude, 
0(n®), as the original (unmodified) network, VC-dimension considerations would 
be insufficient to reveal the exponential lower bound. 

While (as mentioned above) there exist already some papers that investigate 
the connection between probabilistic models and inner product spaces, it seems 
that this work is the first one which addresses explicitly the question of finding 
a smallest-dimensional sufficiently expressive inner product space. It should be 
mentioned however that there exist a couple of papers [10,11,3,13,12,20,21] (not 
concerned with probabilistic models) considering the related question of finding 
an embedding of a given concept class into a system of half-spaces. The main 
technical contribution of our work can be seen in uncovering combinatorial and 
algebraic structures within Bayesian networks such that techniques known from 
these papers can be brought into play. 

2 Preliminaries 

In this section, we present formal definitions for the basic notions in this pa- 
per. Subsection 2.1 is concerned with notions from learning theory. In Subsec- 
tion 2.2, we formally introduce Bayesian networks and the distributions and 
concept classes induced by them. The notion of a linear arrangement for a con- 
cept class is presented in Subsection 2.3. 

2.1 Concept Classes and VC-Dimension 

A concept class C over domain T is a family of functions of the form / : T — >■ 
{— 1, -1-1}. Each / G C is then called a concept. A set S' = {si, . . . , s^} C T of 
size m is said to be shattered by C if 

V6 G {-1, +1}™, 3/ G C, Vi = 1, . . . , m : /(s,) = h ■ 

The VC-dimension of C is given by 

VCdim(C) = sup{m|3S C T : jSj = m and S is shattered by C} . 

For every z G K, let sign(z) = -|-1 if z > 0 and sign(^) = — 1 otherwise. In 
the context of concept classes, the sign-function is sometimes used for mapping 
real- valued functions / to ± 1-valued functions sign o /. 

We write C < C for concept classes C over domain X and C over domain X' 
if there exist mappings 

C 9 / /' G C', T 9 X x' G T' 

such that /(x) = f'(x') for every f € C and every x € X. Note that C < C 
implies that VCdim(C) < VCdim(C') because the following holds: if S' C T is a 
set of size m that is shattered by C then S' = {s'js G Sj C A" is a set of size m 
that is shattered by C . 
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2.2 Bayesian Networks 

Definition 1. A Bayesian network A/” consists of the following components: 

1. a directed acyclic graph Q = (V,E), 

2. a collection {Pi,a)i£V,a£{o,i}"‘i of programmable parameters with values in 
the open intervall ]0, 1[, where mi denotes the number of j € V such that 

O', i) G E, 

3. constraints that describe which assignments of values from ]0, 1[ to the pa- 
rameters of the collection are allowed. 

If the constraints are empty, we speak of an unconstrained network. Otherwise, 
we say the network is constrained. 

Conventions: We will identify the n = |IG| nodes of Af with the numbers from 1 
to n and assume that every edge (j,i) G E satisfies j < i (topological ordering). 
If {j, i) G E, then j is called a parent of i. Pi denotes the set of parents of node 
i and rrij = \Pi \ denotes the number of parents. Af is said to be fully connected if 
Pi = 1, . . . ,i — 1 for every node i. We will associate with every node i a Boolean 
variable Xi with values in {0,1}. We say Xj is a parent- variable of Xi if j is 
a parent of i. Each a G (0, 1}™* is called a possible bit-pattern for the parent- 
variables of Xi. Mi^a{x) denotes the polynomial that indicates whether the parent 
variables of Xi exhibit bit-pattern a. More formally, Mi a{x) = IljePi where 
x^ = 1 — Xj and Xj = Xj. 

An unconstrained network with a dense graph has an exponentially growing 
number of parameters. In a constrained network, the number of parameters can 
be kept reasonably small even in case of a dense topology. The following two 
definitions exemplify this approach. Definition 2 contains (as a special case) 
the networks that were proposed in [5]. (See Example 2 below.) Definition 3 
deals with so-called logistic autoregressive Bayesian networks that, given their 
simplicity, perform surprisingly well on some problems. (See the discussion of 
these networks in [15].) 

Definition 2. A Bayesian network with a reduced parameter collection is a 
Bayesian network whose constraints can be described as follows. For every i G 
{!,..., n|, there exists a surjective function Ri : (0, 1}™* — >■ |l,...,(ii} such 
that the parameters of Af satisfy 

Vi = 1, . . . ,n,Va,a' G (0, 1}""M i?j(a) = i?i(a') ^ p, „ = „/ . 

We denote the network as Af^ for R = (i?i, . . . , i?„). Obviously, Af^ is com- 
pletely described by the reduced parameter collection {pi^c)i<i<n,i<c<di- 



Definition 3. The logistic autoregressive Bayesian network Afa is the fully con- 
nected Bayesian network with the following constraints on the parameter collec- 
tion: 



i-l 



Vi = 1, . . . ,n,3(wij)i<j<i_i G K* \VaG|0, 1}* ^ Pi^a = 
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where a{y) = l/(l + e denotes the standard sigmoid function. Obviously, Ma 
is completely described by the parameter collection 

In the introduction, we mentioned that Bayesian networks graphically model 
conditional independence relationships. This general idea is captured in the fol- 
lowing 

Definition 4. Let M be a Bayesian network with nodes The class 

of distributions induced by Af, denoted as consists of all distributions on 
{0, 1}” of the form 



^(^)=n n • ( 1 ) 

ae{0,l}'”i 

For every assignment of values from ]0, 1[ to the parameters of Af , we obtain a 
concrete distribution from Vj^. Recall that not each assignment is allowed if Af 
is constrained. 

The polynomial representation of logP(a:) resulting from (1) is called Chow 
expansion in the pattern classification literature [9]. Parameter pi^a represents 
the conditional probability for Xi = 1 given that the parent variables of Xi 
exhibit bit-pattern a. Formula (1) expresses P{x) as a product of conditional 
probabilities (chain-expansion) . 

Example 1 (k-order Markov chain). For k > 0, Afk denotes the unconstrained 
Bayesian network with Pi = {z— 1, . . . , i—k} for z = 1, . . . , rz (with the convention 
that numbers smaller than 1 are ignored such that mi = \Pi\ = min{z — 1,/c}). 
The total number of parameters equals 2^(rz — k) + 2^“^ -k----l-2-|-l = 2*(rz — 
k+1) - 1. 

We briefly explain that, for a Bayesian network with a reduced parameter 
set, distribution P{x) from Definition 4 can be written in a simpler fashion. 
Let Ri^c{x) denote the 0, 1-valued function that indicates for every x G {0, 1}” 
whether the projection of x to the parent- variables of Xi is mapped to c by Ri. 
Then, the following holds: 

n di 

p{x ) = ■ ( 2 ) 

1 c— 1 

Example 2. Chickering, Heckerman, and Meek [5] proposed Bayesian networks 
“with local structure” . They used a decision tree Ti (or, alternatively, a decision 
graph Gi) over the parent- variables of Xi for every z G {1, . . . , zz}. The conditional 
probability for Xi = \ given the bit-pattern of the variables from Pi is attached to 
the corresponding leaf in Ti (or sink in Gi, respectively). This fits nicely into our 
framework of networks with a reduced parameter collection. Here, di denotes the 
number of leaves in Ti (or sinks of Gi, respectively), and Ri{a) = cG {l,...,<ii} 
if a is routed to leaf c in Ti (or to sink c in Gi, respectively). 
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In a two-label classification task, functions P{x),Q{x) £ T>_\f are used as dis- 
criminant functions, where P{x) and Q{x) represent the distributions of x con- 
ditioned to label -|-1 and —1, respectively. The corresponding decision function 
assigns label -1-1 to x if P{x) > Q{x) and —1 otherwise. The obvious connection 
to concept classes in learning theory is made explicit in the following 

Definition 5. Let M he a Bayesian network with nodes l,...,n and T>jy 
the corresponding class of distributions. The class of concepts induced by Af, 
denoted as Cjy, consists of all ±l-valued functions on {0,1}" of the form 
sign(log(P(x)/(5(x))) for P,Q £ Pjy. Note that this function attains value -1-1 
if P{x) > Q(x) and value —1 otherwise. 

The VC-dimension of C^f is simply denoted as VCdim(A/’) throughout the paper. 

2.3 Linear Arrangements in Inner Product Spaces 

As explained in the introduction, we restrict ourselves to finite-dimensional Eu- 
clidean spaces and the standard scalar product v = where 

denotes the transpose of u. 

Definition 6. A d-dimensional linear arrangement for a concept class C over 
domain X is given by collections (uf) f^c an-d (vx)xex of vectors in such that 

'if £C,x £ X : f{x) = sign(M|ua;) . 

The smallest d such that there exists a d-dimensional linear arrangement for C 
(possibly CO if there is no finite-dimensional arrangement) is denoted as Edim(C). 

4 

If is the concept class induced by a Bayesian network Af, we simply write 
Edim(A/”) instead of Edim(CA^). Note that Edim(C) < Edim(C') if C <C. 

It is easy to see that Edim(C) < min||C|, |A|| for finite classes. Less trivial 
upper bounds are usually obtained constructively, by presenting an appropriate 
arrangement. As for lower bounds, the following is known: 

Lemma 1 . Edim(C) > VCdim(C). 

Lemma 2 ([ 10 ]). Let fi,. ■ . ,fm & C, x\,...,Xn £ X, and M £ (—1, +!}'"><” 
be the matrix given by Mij = fi{xj). Then, Edim(C) > y/mn/\\M\\, where 
||M|| = sup^g]^".|l^ll^^j^ ||Mz ||2 denotes the spectral norm of M . 

Lemma 1 easily follows from a result by Cover [6] which states that 
VCdim(|sign o f\f £ T'\) = d for every d-dimensional vector space T consisting 
of real- valued functions. Lemma 2 (proven in [10]) is highly non-trivial. 

Let PARITY„ denote the concept class {ha\a G {0,1}"} on the Boolean 
domain given by ha{x) = (—1)“ Let £ {— 1, -|-1}^"^^" denote the matrix 
with entry ha{x) in row a and column x (Hadamard-matrix) . From Lemma 2 and 
the well-known fact that ||iLn|l = 2"/^ (which holds for any orthogonal matrix 
from {—1, -1-1}^ ), one gets 

Corollary 1 ([ 10 ]). Edim(PARITY„) > 2"/||iJ„|| = 2"/^. 

^ Edim stands for Euclidean dimension. 
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3 Linear Arrangements for Bayesian Networks 



In this section, we present concrete linear arrangements for several types of 
Bayesian networks, which leads to upper bounds on Edim(A/”). We sketch the 
proofs only. 

For a set M, 2^ denotes its power set. 



Theorem 1. For every unconstrained Bayesian network, 



the following holds: 



Edim(A/') < 



n 




2 = 1 



< 2 • ^ 2 '"* . 
i=l 



Proof. From the expansion of P in (1) and the corresponding expansion of Q 
(with parameters Qi^a in the role of Pi^a), we get 



log 



P{x) 

Q{x) 



^ log — + (1 a(a;) log . (3) 



On the right-hand side of (3), we find the polynomials and XiMi^^ix). 

Note that | equals the number of monomials that occur when we 

express these polynomials as sums of monomials by successive applications of 
the distributive law. A linear arrangement of the appropriate dimension is now 
obtained in the obvious fashion by introducing one coordinate per monomial. 

□ 



Corollary 2. Let J\fk denote the Bayesian network from Example 1. Then: 

Edim(A/fc) < {n — k + 1)2^ . 

Proof. Apply Theorem 1 and observe that 

n n 

\J {J,U{i}\J,C{i-l,...,i-k}}U{J\JC{l,...,k}} . □ 

i=l i=k+l 



Theorem 2. Let be a Bayesian network with a reduced parameter set 
{Pi,c)i<i<n,i<c<di in the sense of Definition 2. Then: 

n 

Edim(A/’^) < 2 • di . 

i=l 



Proof. Recall that the distributions from Djij-r can be written in the form (2). 
We make use of the following obvious equation: 



log 



P{x) 

Q{x) 



n di 

^ y^,y^. xiRi^c{x)iog 

i—l c—1 



Pi,c 

Qi,c 



+ (i 



Xi)Ri^c{x) log 



1 - Pt,c 
1 Qi,c 



(4) 
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A linear arrangement of the appropriate dimension is now obtained in the obvious 
fashion by introducing two coordinates per pair (i,c): if x is mapped to Vx in 
this arrangement, then the projection of Vx to the two coordinates corresponding 
to (t,c) is {Ri^c{x),XiRi^c{x))] the appropriate mapping (P,Q) e- >■ up^q in this 
arrangement is easily derived from (4). □ 



Theorem 3. Let Ma denote the logistic autoregressive Bayesian network from 
Definition 3. Then, VCdim(A/'CT) = 0(n^). 

The proof of this theorem is found in the full paper. 

Remark 1. The linear arrangements for unconstrained Bayesian networks or for 
Bayesian networks with a reduced parameter set were easy to find. This is no 
accident: a similar remark is valid for every class of distributions (or densities) 
from the exponential family because (as pointed out in [8] for example) the 
corresponding Bayes-rule takes the form of a so-called generalized linear rule 
from which a linear arrangement is evident.® See the full paper for more details. 

4 Lower Bounds on the Dimension of an Arrangement 

In this section, we derive lower bounds on Edim(A/”) that match the upper bounds 
from Section 3 up to a small gap. Before we move on to the main results in Sub- 
sections 4.1 and 4.2, we briefly mention (without proof) some specific Bayesian 
networks where upper and lower bound match. The proofs are found in the full 
paper. 

Theorem 4. VCdim(A/o) = Edim(A/o) = n -I- 1 if Nq has n > 2 nodes and 
VCdim(A/o) = Edim(j^o) = 1 */Ao has 1 node. 



Theorem 5. For k > 0, let M'f. denote the unconstrained network with Pi = 
{1, . . . , fc} for i = k + 1, . . . ,n and Pt = % for i = 1, . . . ,k. Then, VCdim(A/'^) = 
Edim(A/’^) = 2^{n — k + 1) 



4.1 Lower Bounds Based on VC-Dimension Considerations 

Since VCdim(C) < VCdim(C') if C < C , a lower bound on VCdim(C') can be 
obtained from classes C <C whose VC-dimension is known or easy to determine. 
We first define concept classes that will fit this purpose. 

Definition 7. Let M he a Bayesian network. For every i G {1, . . . , n}, let Ti be 
a family of zLl-valued functions on the domain {0, 1}'"* and iF = iFi x • • • x 

® The bound given in Theorem 1 is slightly stronger than the bound obtained from 
the general approach for members of the exponential family. 
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We define as the concept class over domain {0, 1}" \ {0} consisting of all 

functions of the form 



'■ — /n)) ■ ■ • ; {xii /l)] ) 

where f = (/i, . . . , /„) G T . The right-hand side of this equation is understood 
as a decision list, where Ljq-j(x) for x ^ 0 is determined as follows: 

1. Find the largest i such that Xi = 1. 

2. Apply fi to the projection of x to the parent-variables of Xi and output the 
result. 



Lemma 3. VCdim(CA^^;r) = VCdim(^i). 

Proof. We prove that VCdim(C 7 \/^jr) > X)r=i (The proof for the 

other direction is similar.) For every i, we embed the vectors from {0, 1}™* 
into {0, 1}" according to Ti{a) := (o', 1,0,..., 0), where a' G {0, 1}®“^ is chosen 
such that its projection to the parent-variables of Xi coincides with a and the 
remaining components are projected to 0. Note that Ti(a) is absorbed in item 
(xi,fi) of the decision list It is easy to see that the following holds. If, for 

z = 1, . . . , n, S'i is a set that is shattered by iFi, then is shattered by 

CAf,y^. Thus, VCdim(CAZ,:p-) > VCdim(.7fi). □ 

The first application of Lemma 3 concerns unconstrained networks. 

Theorem 6. Let Af be an unconstrained Bayesian network and let T* denote 
the set of all ztl-valued functions on domain {0, 1}™* and T* = y. ■ ■ ■ x iF*. 
Then, < Cjq-. 



Proof. We have to show that, for every / = (/i,...,/„), we find a pair 
{P,Q) of distributions from T>_\f such that, for every x G {0,1}”, L_\fj{x) = 
sign(log(P(x)/(3(x))). To this end, we define the parameters for the distribu- 
tions P and Q as follows: 



Pi, a — 



= ^ i 



if Mo:) = -1 

if Mo) = +1 

An easy calculation now shows that 



12-2* 

2 



and = 



log ( 



Pi, a 

\T,a 



= /i(a)2* and 



log 



1 

2 

f2-2* 



1 - Pi,o 

1 Qi.o 



if Mo) = -1 
if Mo) = +1 



< 1 



(5) 



Fix an arbitrary x G {0, 1}" \ {0}. Choose z* maximal such that Xi^ = 1 and let 
Of* denote the projection of x to the parent- variables of Xi^. Then, Ljq j{x) = 
fi,{o*). Thus, L_\fj{x) = sign(log(P(a;)/(5(x))) would follow immediately from 



sign 




= sign 




M{o*) ■ 



(6) 
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The second equation in (6) is evident from (5). As for the first equation in (6), 
we argue as follows. By the choice of A, = 0 for every t > z*. In combination 
with (3) and (5), we get 



P{x) p*. 
log 777TT = log ■ 



Q{x) 



Qi* 



xiMi^a{x)iog^^ 

i=l aG{0,l}™i 

1 Pi, a 



iel aG{0,l}"*i 



1 Qi,a 



where I = {1, . . . , n} \ {z*}. The sign of the right-hand side of this equation is 
determined by log(pi,,Q./gi,,Q.) since this term is of absolute value 2**“^rz and 
2**“^n — — 1) > 1- This concludes the proof. □ 

The next two results are straightforward applications of Lemma 3 combined with 
Theorems 6, 1, and Corollary 2. 

Corollary 3. For every unconstrained Bayesian network M , the following holds: 



n 

2™* < VCdim(Af) < Edim(AT) < 

2 = 1 



n 




2=1 



< 2 • ^ 2™‘ 

i=l 



Corollary 4. Let Mk denote the Bayesian network from Example 1. Then: 

{n — k + 1)2* — 1 < VCdim(A/fe) < Edim(A/fc) < {n — k + 1)2* . 

We now show that Lemma 3 can be applied in a similar fashion to the more 
general case of networks with a reduced parameter collection. 

Theorem 7. LetAf^ he a Bayesian network with a reduced parameter collection 
{Pi,c)i<i<n,i<c<di in the sense of Definition 2. Let denote the set of all ±1- 
valued functions on domain {0, 1}™* that depend on a & {0, 1}™* only through 
Ri{a). Ln other words, f G iff there exists a ±l-valued function g on domain 
{1, . . . ,di] such that f{a) = g{Ri{a)) for every a € {0, Ij^L Finally, let = 
X • • • X J'n"' ■ Then, Cj^/r j^r < CyK. 

Proof. We focus on the differences to the proof of Theorem 6. First, the decision 
list Lj,j-rj uses a function / = (/i,...,/„) of the form fi{x) = gi{Ri{x)) for 
some function gi : {l,...,di} — >■ {— 1,-|-1}. Second, the distributions P, Q that 
satisfy Lj,/j{x) = sign(log(P(a;)/(5(a:))) for every x G {0, 1}” have to be defined 
over the reduced parameter collection. Compare with (4). An appropriate choice 
is as follows: 



1 9-2* 
Pi,c = \ \ 



if g^{c) = -1 



and qi^c = 



1 

2 

l0-2* 

2^ 



if ffi(c) = -1 
if g^{c) = -kl 



2 ifff*(c) = -kl 

The rest of the proof is completely analogous to the proof of Theorem 6. □ 
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From Lemma 3 and Theorems 7 and 2, we get 

Corollary 5. LetAf^ he a Bayesian network with a reduced parameter collection 
{Pi,c)i<i<n,i<c<di in the sense of Definition 2. Then: 

n n 

^ d* < VCdim(AT^) < Edim(AT«) < 2 • ^ di . 

i=l 

Lemma 3 does not seem to apply to constrained networks. However, some of 
these networks allow for a similar reasoning as in the proof of Theorem 6. More 
precisely, the following holds: 

Theorem 8. Let M he a constrained Bayesian network. Assume there ex- 
ists, for every i € n}, a collection of pairwise different hit-patterns 

. . . ,ai^di G {0, 1}™* such that the constraints of M allow for the follow- 
ing independent decisions: for every pair (i,c), where i ranges from 1 to n and 
c from 1 to di, parameter Pi^^i „ is set either to value 2“^ ”/2 or to value 1/2. 

Then: 

n 

VCdim(Af) > ^ d, . 

i=l 

Proof. For every pair (z, c), let Xi^c G {0,1}" be the vector that has bit 1 in 
coordinate i, bit-pattern ^ in the coordinates corresponding to the parents 
of i, and zeros in the remaining coordinates (including positions z -I- 1, . . . , n). 
Following the train of thoughts in the proof of Theorem 6, it is easy to see that 
the vectors Xi^c are shattered by C_\f. □ 



Corollary 6. Let Ma denote the logistic autoregressive Bayesian network from 
Definition 3. Then: 

VCdim(A/’cr) > ^(zz — l)zz . 

Proof. We aim at applying Theorem 8 with di = i — 1 for z = 1, . . . ,rz. For 
c = 1, . . . , z — 1, let ai^c G {0, 1}*“^ be the pattern with bit 1 in position c and 
zeros elsewhere. It follows now from Definition 3 that Pi^ai c — Since 

(t(]R) =]0, 1[, the parameters Pi^ui „ can independently be set to any value of our 
choice in ]0, 1[. Thus, Theorem 8 applies. □ 



4.2 Lower Bounds Based on Spectral Norm Considerations 

We would like to show an exponential lower bound on Edim(A/”cr). However, at 
the time being, we get such a bound for a slight modification of this network 
only: 
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Definition 8. The modified logistic autoregressive Bayesian network is the 
fully connected Bayesian network with nodes 0, 1, . . . ,n + 1 and the following 
constraints on the parameter collection: 



Vz — 0, . . . , 3 1 ^ M ,Vq; G {0,1} . Pi^oi — ^1 ^ ^ ^3 

\ 3=0 



and 



( ” 

3(wj)o<i<„,Va G {0, 1}"+^ : Pn+i,a = o’ WiC 



\J=0 



Obviously, Afa is completely described by the parameter collections 
{uJij)o<i<n,0<j<i—l and 

The crucial difference between Aff and Af^ is the node n + 1 whose sigmoidal 
function gets the outputs of the other sigmoidal functions as input. Roughly 
speaking, Af„ is a “one-layer” network whereas Af^ has an extra node at a “second 
layer” . 

Theorem 9. Let Af'„ denote the modified logistic autoregressive Bayesian net- 
work with n-\-2 nodes, where we assume (for sake of simplicity only ) that n is a 
multiple of 4. Then, PARITY „/2 ^ ■N'f even if we restrict the “weights” in the 
parameter collection of Aff to integers of size 0(logn). 



Proof. The mapping 



{0, l}"/2 ^ x = {xi,..., x„/ 2 ) (1, Xi, . . . , x„/ 2 , 1, . . . , 1, 1) = a;' G {0, 1}”+^ , 

( 7 ) 

embeds {0, 1}”/^ into {0, 1}”+^. Note that a, as indicated in (7), equals the bit- 
pattern of the parent-variables of (which are actually all other variables). 
We claim that the following holds. For every a G {0, 1}"/^, there exists a pair 
{P, Q) of distributions from such that, for every x G {0, 1}”/^, 

(Clearly the theorem follows once the claim is settled.) The proof of the claim 
makes use of the following facts: 

Fact 1. For every a G {0, 1}"/^, function (—1)“^“ can be computed by a 2-layer 
(unit weights) threshold circuit with n/2 threshold units at the first layer 
(and, of course, one output threshold unit at the second layer). 

Fact 2. Each 2-layer threshold circuit C with polynomially bounded integer 
weights can be simulated by a 2-layer sigmoidal circuit C' with polynomially 
bounded integer weights, the same number of units, and the following output 
convention: C{x) = 1 C'(x) > 2/3 and C(x) = 0 C'{x) < 

1/3. The same remark holds when we replace “polynomially bounded” by 
“logarithmically bounded” . 
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Fact 3. A/”ct contains (as a “substructure”) a 2-layer sigmoidal circuit C with 
n/2 input nodes, n/2 sigmoidal units at the first layer, and one sigmoidal 
unit at the second layer. 

Fact 1 (even its generalization to arbitrary symmetric Boolean functions) is well 
known [16]. Fact 2 follows from a more general result by Maass, Schnitger, and 
Sontag. (See Theorem 4.3 in [22].) The third fact needs some explanation. (The 
following discussion should be compared with Definition 8.) We would like the 
term Pn+i,a to satisfy Pn+i,a = C'{ai , . . . , an/ 2 ), where C denotes an arbitrary 
2-layer sigmoidal circuit as described in Fact 3. To this end, we set Wi^j = 0 if 
1 < i < n/2 or if i,j > n/2 -|- 1. We set = 0 if 1 < z < n/2. The parameters 
which have been set to zero are referred to as redundant parameters in what 
follows. Recall from (7) that oq = an/ 2+1 = • • • = On = 1. From these settings 
(and from (t( 0) = 1/2), we get 

W n / n/2 

Pn+I,a = a -Wo + ^ w,a -b ^ Wi+aj 

\ i=n/2+l \ 3 = 1 

This is the output of a 2-layer sigmoidal circuit C on input (oi, . . . , a„/ 2 )> 
indeed. 

We are now in the position to describe the choice of distributions P and Q. Let 
C be the sigmoidal circuit that computes (—1)“ ^ for some fixed a € {0, 1}"/^ 
according to Facts 1 and 2. Let P be the distribution obtained by setting the 
redundant parameters to zero (as described above) and the remaining parameters 
as in C. Thus, Pn+i,a = C'{ai, . . . ,a„/ 2 )- Let Q be the distribution with the 
same parameters as P except for replacing Wi by —Wi. Thus, by symmetry of 
a, qn+i,a = 1 — C{ai , . . . , an/ 2 )- Since x'n+i = 1 and since all but one factor in 
P{x')IQ{x') cancel each other, we arrive at 

P{x ) Pn+l,a ^ (^1? ■ ■ ■ 7 ^n/ 2 ) 

Q{x') qn+i,a 1 - C"(ai, . . . ,q;„/ 2 ) 

Since C computes (—1)“^® (with the output convention from Fact 3), we get 
P{x')/Q{x') > 2 if (—1)“^^ = 1, and P{x')/Q{x') <1/2 otherwise, which 
implies (8) and concludes the proof of the claim. □ 

From Corollary 1 and Theorem 9, we get 

Corollary 7. Edim(A/(.) > 2”/^. 

We mentioned in the introduction (see the remarks about random projections) 
that a large lower bound on Edim(C) rules out the possibility of a large margin 
classifier. For the class PARITY^, this can be made more precise. It was shown 
in [10,13] that every linear arrangement for PARITY„ has an average geometric 
margin of at most 2“”/^. Thus there can be no linear arrangement with an 
average margin exceeding 2“”/^ for Cjq'/ even if we restrict the weight parameters 
in to logarithmically bounded integers. 
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Open Problems 1) Determine Edim for the (unmodified) logistic autoregres- 
sive Bayesian network. 

2) Determine Edim for other popular classes of distributions or densities (where, 
in the light of Remark 1, those from the exponential family look like a good 
thing to start with). 



Acknowledgements. Thanks to the anonymous referees for valuable comments 
and suggestions. 
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Abstract. We prove that given a nearly log-concave density, in any 
partition of the space to two well separated sets, the measure of the points 
that do not belong to these sets is large. We apply this isoperimetric 
inequality to derive lower bounds on the generalization error in learning. 
We also show that when the data are sampled from a nearly log-concave 
distribution, the margin cannot be large in a strong probabilistic sense. 
We further consider regression problems and show that if the inputs and 
outputs are sampled from a nearly log-concave distribution, the measure 
of points for which the prediction is wrong by more than eo and less than 
ei is (roughly) linear in ei — eo. 



1 Introduction 

Large margin classifiers (e.g., [CS00,SBSS00] to name but a few recent books) 
have become an almost ubiquitous approach in supervised machine learning. The 
plethora of algorithms that maximize the margin, and their impressive success 
(e.g., [SS02] and references therein) may lead one to believe that obtaining a 
large margin is synonymous with successful generalization and classification. In 
this paper we directly consider the question of how much weight the margin 
must carry. We show that essentially if the margin between two classes is large, 
then the weight of the “no-man’s land” between the two classes must be large 
as well. Our probabilistic assumption is that the data are sampled from a nearly 
log-concave distribution. Under this assumption, we prove that for any partition 
of the space into two sets such that the distance between those two sets is t, the 
measure of the “no man’s land” outside the two sets is lower bounded by t times 
the minimum of the measure of the two sets times a dimension- free constant. The 
direct implication of this result is that a large margin is unlikely when sampling 
data from such a distribution. 

Our modelling assumption is that the underlying distribution has a /3-log- 
concave density. While this assumption may appear restrictive, we note that 
many “reasonable” functions belong to this family. We discuss this assumption 
in Section 2, and point out some interesting properties of /3-log-concave functions. 

* C. Caramanis is eligible for the Best student paper award. 
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In Section 3 we prove an inequality stating that the measure (under a [3- 
log-concave distribution) of the “no-man’s” land is large if the sets are well 
separated. This result relies essentially on the Prekopa-Leindler inequality which 
is a generalization of the Brunn-Minkowski inequality (we refer the reader to 
the excellent survey [Gar02]). We note that Theorem 2 was stated in [LS90] 
for volumes, and in [AK91] for /3-log-concave distributions, in the context of 
efficient sampling from convex bodies. However, there are steps in the proof 
which we were unable to follow. Specifically, the reduction in [AK91] to what 
they call the “needle- like” case is based on an argument used in [LS90], which 
uses the Ham-Sandwich Theorem to guarantee not only bisection, but also some 
orthogonality properties of the bisecting hyperplane. It is not clear to us how 
one may obtain such guarantees from the Ham-Sandwich Theorem. Furthermore, 
the solution of the needle-like case in [AK91] relies on a uniformity assumption 
on the modulation of the distribution, which does not appear evident from the 
assumptions on the distribution. We provide a complete proof of the result using 
the Ham-Sandwich Theorem (as in [LS90]) and a different reduction argument. 
We further point out a few natural extensions. 

In Section 4 we specialize the isoperimetric inequality to two different se- 
tups. First, we provide lower bounds for the generalization error in classification 
under the assumption that the classifier will be tested using a /3-log-concave dis- 
tribution, which did not necessarily generate the data. While this assumption is 
not in line with the standard PAG learning formulation, it is applicable to the 
setup where data are sampled from one distribution and performance is judged 
by another. Suppose, for instance, that the generating distribution evolves over 
time, while the true classifier remains fixed. We may have access to a training 
set generated by a distribution quite different from the one we use to test our 
classifier. We show that if there is a large (in a geometric sense) family of clas- 
sifiers that agree with the training points, then for any choice of classifier there 
exists another classifier compared to which the generalization error is relatively 
large. Second, we consider the typical statistical machine learning setup, and 
show that for any classifier the probability of a large margin (with respect to 
that classifier) decreases exponentially fast to 0 with the number of samples, if 
the data are sampled from a /3-log-concave distribution. It is important to note 
that the /3-log-concave assumption applies to the input space. If we use a Mercer 
kernel, the induced distribution in the feature space may not be /3-log-concave. 
If the kernel map is Lipschitz continuous with constant L, then we can relate the 
“functional” margin in the feature space to the “geometric” margin in the input 
space, and our results carry over directly. If the kernel map is not Lipschitz, then 
our results do not directly apply. 

In Section 5 we briefly touch on the issue of regression. We show that if we 
have a regressor, then the measure of a tube around its prediction with inner 
radius Cq and outer radius Ci is bounded from below by ei — eg times a constant 
(as long as ei is not too large). The direct implication of this inequality is that 
the margins of the tube carry a significant portion of the measure. 
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Some recent results [BES02,Men04] argue that the success of large margin 
classifiers is remarkable since most classes cannot have a useful embedding in 
some Hilbert space. Our results provide a different angle, as we show that having 
a large margin is unlikely to start with. Moreover, if there happens to be a large 
margin, it may well result in a large error (which is proportional to the margin). 
A notable feature of our bounds is that they are dimension-free and are therefore 
immune to the curse of dimensionality (this is essentially due to the /3-log-concave 
assumption) . We note the different flavor of our results from the “classical” lower 
bounds (e.g., [AB99,Vap98]) that are mostly concerned with the PAC setup and 
where the sample complexity is the main object of interest. We do not address 
the sample complexity directly in this work. 



2 Nearly Log-Concave Functions 

We assume throughout the paper that generalization error is measured using a 
nearly log-concave distribution. In this section we define such distributions and 
highlight some of their properties. While we are mostly interested in distribu- 
tions, it is useful to write the following definitions in terms of a general function 
on M”. 

Definition 1. A function f : R" — >■ K is /3-log-concave for some /3 > 0 if for 
any A G (0, 1), x\ G M", X 2 G K", we have that: 

f{\xi -k (1 - A)x 2 ) > e~^ f{xi)^f{x2Y~^. (2.1) 

A function f is log-concave if it is 0-log-concave. 

The class of log-concave distributions itself is rather rich. For example, it 
includes Gaussian, Uniform, Logistic, and Exponential distributions. We refer 
the reader to [BB89] for an extensive list of such distributions, sufficient con- 
ditions for a distribution to be log-concave, and ways to “produce” log-concave 
distributions from other log-concave distributions. The class of /3-log-concave 
distributions is considerably richer since we allow a factor of in Eq. (2.1). 
For example, unlike log-concave distributions, /3-log-concave distributions need 
not be continuous. We now provide some results that are useful in the sequel. 
We start from the following observation. 

Lemma 1. The support of a [3-log-concave function is a convex set. Also, [3-log- 
concave functions are bounded on hounded sets. 

Distributions that are /3-log-concave are not necessarily unimodal, but possess a 
unimodal quality, in the sense of Lemma 2 below. This simple lemma captures 
the properties of /3-log-concavity that are central to our main results and subse- 
quent applications. It implies that if we have a /3-log-concave distribution on an 
interval, there cannot be any big “holes” or “valleys” in the mass distribution. 
Thus if we divide the interval into three intervals, if the middle interval is large, 
it must also carry a lot of the weight. In higher dimensions, essentially this says 
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that if we divide our set into two sets, if the distance between the sets is large, 
the mass of the “no-man’s land” will also be large. This is essentially the content 
of Theorem 2 below. 

Lemma 2 . Suppose that f{x) is j 3 -log-concave on an interval [ui,U2]- Let ui < 
xi < X2 < U2- Then for any x G [xi,X2], at least one of the following holds: 

f{x) > f{y) ■ e~^, for all y G [^1,0:1], 

or 

fix) > f{y) ■ e~>^, for all j/ G [x2,M2]- 

Proof. Fix e > 0. There is some x* G [mi,M 2 ] such that sup,jg[„^ /(cc) < 
fix*) + e. Suppose x* G [t6i,xi]. Then for any x G [xi,X2] and y G [x 2 ,rt 2 ], and 
for some A G (0, 1) we have x = Xx* + (1 — X)y, and by the /3-log-concavity of /, 
we have 



fix) > fix*)^fiy)^-^e-^ > ifiy) - e)^fiy)^-^e-r ( 2 . 2 ) 

Similarly, if x* G [x2^U2]^ then for every x G [a;i,a; 2 ] and y G [ui,xi], Eq. (2.2) 
holds. Finally, if x* G [xi,a: 2 ], then for any x G [x\,x*], Eq. (2.2) holds for 
y G [ui,a;i], and for x G [cc*,a: 2 ], Eq. (2.2) holds for any y G [x2,U2]- Take a 
sequence \ 0. We know that for every Eq. (2.2) holds for all x G [xi,X2] 
and all y G [ui,xi] or all y G [x 2 ,U 2 ]. It follows that there exists a sequence 
Ci \ 0 such that for all x G [xi,X2], Eq. (2.2) holds for all y G [ui,xi] or for 
all y G [a; 2 ,M 2 ]. Since Ci converges to 0, fix) > fiy)e~^ in at least one of those 
domains. □ 

The following inequality has many uses in geometry, statistics, and analysis 
(see [Gar02]). Note that it is stated with respect to a specific A G (0, 1) and not 
to all A. 

Theorem 1 (Prekopa-Leindler Inequality). Let 0 < A < 1, and h,gi,g2 
be nonnegative integrahle functions on K", such that /i((l — A)a: -I- Xy) > 
giix)^~^g2iy)^ , for every x,y G M". Then 



hix) dx > ( / giix) dx 



l-A 



g2ix) dx 



'R" 



The following lemma plays a key part in the reduction technique we use 
below. Recall that the orthogonal projection of a set K C onto M” is 

defined as = {x G M" : G K™ s.t. ix,y) G K}. 

Lemma 3 . Let fix,y) be a / 3 -log-concave distribution on a convex set K C 
K"-*”™. For every x in R'|r»i consider the section K/x) = {(x,y) G : 

ix,y) G K}. Then the distribution F/x) = fix,y) dy is / 3 -log -concave on 

K/x). 
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Proof. This is a consequence of the Prekopa-Leindler inequality as in [Gar02], 
Section 9, for log-concave distributions. Adapting the proof for /3- log-concave 
distributions is straightforward. □ 

There are quite a few interesting properties of /3-log-concave distributions. 
For example, the convolution of a /3i-log-concave and a /32-log-concave distribu- 
tion is (/3i -|-/32)-log-concave; Gaussian mixtures are /3-log-concave; and mixtures 
of distributions with bounded Radon-Nikodym derivative are also /3-log-concave. 
These properties will be provided elsewhere. 

3 Isoperimetric Inequalities 

In this section we prove our main result concerning /3- log-concave distributions. 
We show that if two sets are well separated, then the “no man’s land” between 
them has large measure relative to the measure of the two sets. We first prove 
the result for bounded sets and then provide two immediate corollaries. Let 
d{x,y) denote the Euclidean distance in K”. We define the distance between 

two sets Ki and K 2 as d{Ki,K 2 ) = d{x,y) and the diameter of 

a set K as diam(A') = snp^ y^j^d{x,y). Given a distribution / we say that 
y{K) = fj^ f{x) dx is the induced measure. A decomposition of a closed set 
K C K" to a collection of closed sets Ki, K 2 , ... ,Ki satisfies that: Uti K^ = K 
and iy{Ki D Kj) = 0 for all i ^ j where i' is the Lebesgue measure on K". 

Theorem 2. Let K he a closed and hounded convex set with non-zero diameter 
mM" with a decomposition K = Ki\JB\JK 2 - For any /3-log-concave distribution 
f{x), the induced measure p, satisfies that 

p{B) > e"^^^^^^^^min{^(3Fi),^(A:2)}- 

We remark that this bound is dimension-free. The ratio d{Ki, K 2 )/ diam(AT) is 
necessary, as essentially it adjusts for any scaling of the problem. We further 
note that the minimum min{/r(ATi), /i(AT 2 )} might be quite small, however, this 
appears to be unavoidable (e.g., consider the tail of a Gaussian, which is log- 
concave). The proof proceeds by induction on the dimension n, with base case 
n = 1. To prove the inductive step, first we show that it is enough to consider 
an “e-flat” set K, i.e., a set that is contained in an ellipse whose smallest axis 
is smaller than some e > 0. Next, we show that for an e-flat set K, we can 
project onto n— 1 dimensions where the theorem holds by induction. By properly 
performing the projection, we show that if the result holds for the projection, it 
holds for the original set. We abbreviate t = d(Ki,K 2 ). The theorem trivially 
holds if 3 = 0, so we can assume that t > 0. From Lemma 1 above, we know that 
the support of f{x) is convex. Thus, we can assume without loss of generality, 
that since K is compact, f{x) is strictly positive on the interior of K . 

Lemma 4. Theorem 2 holds for n = 1 . 
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Proof. If n = 1, then iL is some interval, iL = [mi, U2], with diam(iC) = |m2 — mi|. 
Since t = d{Ki,K2) > 0, no point of K\ is within a distance t from any point 
of K2- Furthermore, there must be at least one interval (61,62) C B such that 
I62 — 6i| > t, and such that (61,62) H {Ki U K2) = 0. Fix some e > 0, with 

e < t/ 2 . Define the e-expansion sets K\ = {x € K : d{x,Ki) < e}, and 

K2 = {x £ K : d{x,K2) < e}. Define B to be the closure of the complement in 
K of KiUK2- Each set is a union of a finite number of closed intervals, and thus 
we have the decomposition [mi,M 2] = Uiii[D-i,D], where each interval [ri_i,ri] 
is either a ATi-interval, a iC2-interval, or a i?-interval. We modify the sets so that 
if the i?-interval [ri_i,rj] is sandwiched by two ATi-intervals (i = 1,2) then we 
add that interval to Ki. If the 5-interval is either the first interval [rg,ri], or 
the last interval, [rm-i,rm], then we add it to whichever set Ki is to its right, 
or left, respectively. 

The three resulting sets Ki,K2, and B are closed, intersect at most at a 
finite number of points, and thus are a decomposition of K. Each set is a union 
of a finite number of closed intervals. Furthermore, t = d{Ki, K2) ^ t — 2 e, and 
Ki D Ki, K2 D K2, and B Q B. By our modifications above, each B-interval 
must have length at least t. 

Consider any 5-interval [ri_i,ri\. Let x* be a maximizer^ of f{x) on [ui,U2], 
and Xniin a minimizer of f{x) on [ri-\,ri]. Suppose that x* > x^in- Then by 
Lemma 2, for any y < r^-i, we must have /(xmin ) > f{y)e ^ . Therefore, 



e ^/r([-«i,ri_i]) = e ^ / /(x) dx < (rj_i - Ui)/(xmin) 

diam(iC) 



< diam(Ar) • /(x^in) < 



(Xj -Ti-i) 



/(x) dx 



diam(Ar) 

< £ — m([d-i,dJ)- 



If instead we have x* < Xmin) then in a similar manner we obtain the inequality 



-R n diam(iC) 

e '’m([d,W 2 ]) < £ ^([r,_i,ri]). 



Therefore, in general, for any i?-interval (ri_i,ri). 



fj,{[ri-i,ri]) > e 



-l3_ 



diam(iC) 



i{/i([ui 



D-i]),Ai([D,U2])}. 



Suppose, without loss of generality, that [xo,ri] is a iCi-interval. Con- 
sider the first 5-interval [xi,r2]. If /x([xi,r2]) > e“^(t/ diam(Ar))/x([x2, M2]), 

^ As in Lemma 2, / may not be continuous, so we may only be able to find a point x* 
(xmin) that is infinitesimally close to the supremum (infimum) of /. For convenience 
of exposition, we assume / is continuous. This assumption can be removed with an 
argument exactly parallel to that given in Lemma 2. 
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then ^(-B) > e diam(_ftT))^(if 2 ) and we are done. So let us as- 
sume that ^([ri,r 2 ]) > e“^(t/ diam(itT))^(['Ui, ri]). Similarly, for the 

last B-interval {rm- 2 ,‘rm-i), we can assume that /i([rm- 2 j ?'m-i]) > 

e“^(t/diam(i4r))^([rm_i, U 2 ]) otherwise the result immediately follows. This 
implies that there must be two consecutive B-intervals, say {rj-i,Tj) 
and (rj+i,rj+ 2 ) such that n{[rj-i,rj]) > e~^{t/ diam{K))fi{[ui,rj-i]) and 
^j,{[rj+i,rj+ 2 ]) > diam{K))fj,{[rj+ 2 ,U 2 ]). Since [ui,rj-i] U [rj+ 2 ,U 2 ] con- 

tains either all of Ki or K 2 , combining these two inequalities, and using the fact 
that Ki ^ Ki, and B C B, we obtain 



KB) > KB) > K[r3-i,rj] U [rj+i,rj+2]) 



> e 



- 0 _ 



diam(itT) 



>e-^ 

>e-^ 



i 

diam(itT) 

t-2e 

diam(itT) 



{K[ui,rj_i\) + K[l"j+2,U2])) 

min{KKi),KK2)} 

min{KKi),KB2)}- 



Since this holds for every e > 0, the result follows. □ 

We now prove the n-dimensional case. The first part of our inductive step is 
to show that it is enough to consider an “e-flat” set K. To make this precise, we 
use the Lowner-John Ellipsoid of a set K. This is the minimum volume ellipsoid 
E containing K (see, e.g. [GLS93]). This ellipsoid is unique. The key property 
we use is that if we shrink E from its center by a factor of n, then it is contained 
in K. We define an e-flat set to be such that the smallest axis of its Lowner-John 
Ellipsoid has length no more than e. 



Lemma 5. Suppose the theorem fails by S on K, for some J > 0, i.e. 

{1 + S)KB) < j^arn(iL) ^^KK^i), KK2)}- (3.3) 



Then for any e > 0, there exists some e-fiat set K C K with decomposition 
K = Ki\JB\JK 2 , such that Ki C Ki, B C B, d{Ki,K 2 ) > t, and diam{K) < d, 
and such that the theorem fails by S, i.e., Eq. (3.3) holds for K , Ki, K 2 , B. 

Proof. Let K, Ki, K 2 , B and S be as in the statement above. Pick some e > 0 
much smaller than t. Suppose that all axes of the Lowner-John ellipsoid of K 
are greater than e. A powerful consequence of the Borsuk-Ulam Theorem, the 
so-called Ham-Sandwich Theorem (see, e.g., [Mat02]) says that in M", given n 
Borel measures /ifc, fc = 1, . . . , n, such that the weight of any hyperplane under 
each measure is zero, there exists a hyperplane El that bisects each measure, i.e., 
^fc(B+) = iik{E[~) = i/Xfc(K”) for each k, where El^,E[~ denote the two half- 
spaces defined by El. Now, since we have n > 2, the Ham-Sandwich Theorem 
guarantees that there exists some hyperplane H that bisects (in terms of the 
measure /i) both K\ and K 2 . Let K' and K" be the two parts of K defined by 
H {K and B are not necessarily bisected), and similarly define K[, K", K' 2 , K'f, 
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and The minimum distance cannot decrease, i.e., d{K[,K 2 ) > t, and 

d{K” , K' 2 ) > t, and the diameter of K cannot be smaller than either the diameter 
of K' or K” . Consequently, if the theorem holds, or fails by less than S, for both 
K' and K” , then 

(1 -t 5)^i{B) = {1 + + (1 -t S)n{B") 

- ^d(^2)} +min|ip(ILr), Ikk!^)}) 

Therefore the theorem must fail by <5 for either K' or K” . We note that this 
is the same S as above. Call the set for which the theorem does not hold , 
and similarly define k[^\K 2 ^'^ and B^^^\ We continue bisecting in this way, 
always focusing on the side for which the theorem fails by S, thus obtaining a 
sequence of nested sets K D 2 • • • 2 2 • • • • 

We claim that eventually the smallest axis of the Lowner-John ellipsoid will 
be smaller than e. If this is not the case, then the set K always contains a ball 
of radius e/n. This follows from the properties of the Lowner-John ellipsoid. 
Therefore, letting B^/n{xo) denote the ball of radius e/n centered at Xg, we have 



= f f{x)dx 
JkO) 



> 



inf 

B,/„(xo)CK 




f{x) dx > ?7 > 0, 



for some rj > 0, independent of j. We know that 77 > 0 by our initial assumption 
that f{x) is non-zero on K. 

(i) (i) 

However, by our choice of hyperplanes, the sets K{ , K 2 are bisected with 
respect to the measure /x. Thus = 2“^/x(iCi), and = 2~^ ^{K 2 ), 

and the measure of each set k[^\K 2 '^ becomes arbitrarily small as j increases. 
Since the measure of does not also become arbitrarily small, the measure 
of B'^d) must also be bounded away from zero. In particular, ii{B^^'>) > rj — 
+ h(K 2 )), and thus for j > log 2 ( 2 {fj,{Ki) + fj,{K 2 ))/r]), ^j,{b''^'>) > 
rj/2 > min{/x(iL^^^), /i(AT2^^)}. This contradicts our assumption that the theorem 
fails on all elements of our nested chain of sets. The contradiction completes the 
proof of the lemma. □ 

Proof of Theorem 2: The proof is by induction on the number of dimensions. 
By Lemma 4 above, the statement holds for n = 1. Assume that the result 
holds for n dimensions. Suppose we have K C K"+^, with the decomposition 
K = KiU BU K 2 , satisfying the assumptions of the theorem. We show that for 
every J > 0: 



{l + S)fi{B) > e~^ 



t 

diam(AT) 



min{fi{Ki),fi{K2)}. 



Taking 6 to zero yields our result. Let E be the Lowner-John ellipsoid of K. 
By Lemma 5 above, we can assume that the Lowner-John ellipsoid of K has 
at least one axis of length no more than e. Figure 1 illustrates the bisecting 
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process of Lemma 5, and also the essential reason why the bisection allows us 
to project to one fewer dimensions. We take e smaller than t/2, and also such 



(a) (6) 




Fig. 1. The inductive step works by projecting K onto one less dimension. In (a) above, 
a projection on the horizontal axis would yield a distance of zero between the projected 
Ki and K2- Once we bisect to obtain (b), we see that a projection onto the horizontal 
axis would not affect the minimum distance between Ki and K2- 



that — 4e^ > t/{l + 6). Assume that the {n + 1)®* coordinate direction is 
parallel to the shortest axis of the ellipsoid, and the first n coordinate directions 
span the same plane as the other n axes of the ellipse (changing coordinates 
if necessary). Call the last coordinate y, so that we refer to points in as 

(x,y), for X £ K”, and y £ R. Let II denote the plane spanned by the other n 
axes, and let Kn = t^{K) denote the projection of K onto II. Since e < tj2, 
no point in Kjj is the image of points in both K\ and K 2 , otherwise the two 
pre-images would be at most 2e < t apart. This allows us to define the sets 

= {{x,y) £ K : Tr{x,y) £ 

K 2 = {{x,y) £ K : ■K{x,y) £ 7r(AT2)}, 

B = {{x,y) £ K : Tr{x,y) ^ n{Ki) Ut:{K 2 )}. 

Note that y.{Ki) > i = 1,2, and Again we have 

a decomposition K = Ki U B U K 2 - On Kn, we also have a decomposi- 
tion: Kn = tt{Ki) U tt{B) U tt{K 2 ). Since we project with respect to the 
norm, by the Pythagorean Theorem, d{Tr{Ki),Tr{K 2 )) > — 4e^. In addition, 

diam(AT.;r) < diam(AT). 

For X G Kn, define the section K{x) = {{x,y) G K”+^ : (x,y) G K}. We 
define a function on Kn C M”: F{x) = f{x,y) dy, where f{x,y) is our 

/3- log-concave function on We have 




F{x) dx 



f{x,y)dxdy = i=l,2, 
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and similarly for B. By Lemma 3, F{x) is /3-log-concave. Therefore, by the 
inductive hypothesis, we have that 



/ f{x,y)dxdy = / F{x)dx 

Jb Jtt(B) 



— 6 ^ I f F{x)dx, [ F{x)dx\ 

diam(AT^) [JniKi) Jtt{K2) ) 

diam(/rj {/ft -ff*’ /ft S'* 



_3 - 4e2 

= e 



>e ^^^^^^|^min{/i(iCi),^(A: 2 )}, 



and thus (1 -I- S)fj,{B) > (t/ diam(iC)) min(/r(iCi), ^(iC 2 ))- Since this holds for 
every i5 > 0, the result follows. □ 

Corollaries 1 and 2 below offer some flexibility for obtaining a tighter lower 
bound on y.{B). 

Corollary 1. Let K he a closed and bounded convex set with a decomposition 
K = Ki U B U K 2 as in Theorem 2 above. Let f{x) he any distribution that 
is hounded away from zero on K, say f{x) > y for x £ K. Then the induced 
measure fj, satisfies 



KB) >V 



d{Ki,K2) 

diam(AT) 



mm{i/{Ki),n{K2)}. 



where v denotes Lehesgue measure. 



Proof. Consider the uniform distribution on K. Since it is log-concave. Theorem 
2 applies with /3 = 0. Since the Lebesgue measure n is just a scaled uniform 
distribution, v{B) > (d(iCi, 3 ^ 2 )/ diam(iL)) min{:/(iCi), i^(iC 2 )}- The corollary 
follows since KB) > r]v{B). □ 

Corollary 2. Fix e > 0. Let K he a closed, convex, but not necessarily bounded 
set. Let K = Ki U B U K 2 be a decomposition of K. Let f be a (3-log-concave 
distribution with induced measure p,, such that there exists d(e) for which (1 — 
e)p(Ki) < p{Ki n Bd(e)), (1 - e)p{K 2 ) < p{K 2 n B^(^^)), and (1 - e)p{B) < 
p{B n i?d(e)), where i?d(£) is a ball with radius d{e) around the origin. Then 

p{B) > e"^(l - ,^in{^(A:i),/x(A:2)}. 

d{e) 

Proof. We have that p{KC\Bj^(^f^f) > {l — e)p{K). Let P = p{Kr\Bd(e)), and note 
that P > 1 — e. Consider the measure p defined on AT fl Pd(e) by the distribution 
f{x) = f{x)/P. It follows that / is /3-log-concave. We now apply Theorem 2 on 
/ to obtain that: KB C B^^e)) > e~^{t/d{e)) min{p{Ki n Pd(e)), C Pd(e))}, 
where t > d{Ki,K 2 ). It follows that /t(3Li r\B^(^„'^) > (1 — e)p{Ki), and similarly 
for K 2 , and p{B)/{l — e) > p{B)/P > p{B fl 33d(e))- The result follows by some 
algebra. □ 
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4 Lower Bounds for Classification and the Size of the 
Margin 



Lower bounds on the generalization error in classification require a careful defini- 
tion of the probabilistic setup. In this section we consider a generic setup where 
proper learning is possible. We first consider the standard classification problem 
where data points x G M" and labels y G {—1, 1} are given, and not necessarily 
generated according to any particular distribution. We assume that we are given 
a set of classifiers "H which are functions from K" to {—1,1}. Suppose that the 
performance of the classifier is measured using some /3-log-concave distribution 
/ (and associated measure y). We note that this model deviates from the “classi- 
cal” statistical machine learning setup. Given a distribution /, the disagreement 
of a classifier h G TL with another classifier h' is defined as: 

A{h\ h') = f -(1 — h{x)h'{x))f{x)dx = y{x G M" : h{x) ^ h'{x)}, 

where y, is the probability measure induced by /. If there exists a true classi- 
fier (not necessarily in T-L) such that y = h*™^{x) then the error of h is 

A{h; . For a classifier h, let K^{h) = {x G K : h{x) = 1}, and similarly 

K~ = {x G K : h{x) = —1}. Given a pair of classifiers h\ and ft -2 we define the 
distance between them as 

dist(/ii, / 12 ) = maxjd (iF+(/ii), iF“(/i 2 )) ,d (K~ (hi), K~'' (h 2 ))} ■ 

We note that dist(/ii , / 12 ) may equal zero even if the classifiers are rather different. 
However, in some cases, dist(/ii, / 12 ) provides a useful measure of difference; see 
Proposition 1 below. 

Suppose we have to choose a classifier from a set TL. This may occur if, 
for example, we are given sample data points and there are several classifiers 
that classify the data correctly. The following theorem states that if the set of 
classifiers we choose from is too large, then the error might be large as well. 
Note that we have to scale the error lower bound by the minimal weight of the 
positively/negatively labelled region. 



Theorem 3. Suppose that f is /3 -log- concave defined on a hounded set K. Then 
for every h G TL there exists h' G TL such that 



A{h-h') > 



e-^Po 
2 diam(iF) 



sup dist(ft-i, ft- 2 )) 

h\ ,h2^'hi 



where Pq = inf min{/x(iF+ (/i)), /i(PT (ft.))}. 

Proof. If sup^^ dist(fti, ft 2 ) = 0, the result follows, so we can assume this 
is not the case. For every e > 0 we can choose h\ G TL and h 2 G TL such 
that dist(fti,ft. 2 ) > sup^^ ^ 2 ) — £• We consider the case where 
dist(fti,ft 2 ) = d{K'^{hi),K~{h 2 )); the other case where d(P'“(fti), PT+(ft 2 )) = 
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dist(/ii, / 12 ) follows in a symmetric manner. Let B = K \ (K~^{hi) U K (/i 2 ))- It 
follows by Theorem 2 that 

KB) > {KK^{hi)),KK~{h2))} ■ (4.4) 

Now, A{h; hi) > Jg X{h{x)^hi(x)}f{x)dx and A{h; *, 2 ) > Jb X{h(x)^h 2 ix)}f{x)dx. 
Since hi{x) yf h 2 {x) on B, then either A{h; hi) > fi{B)/2 or A{h; ft- 2 ) > h{B)/2. 
Since Pq < KB~^{hi)) and Pq < /r(iC“(ft- 2 )), and by substituting in Eq. (4.4) 
we obtain that A{h,hi) > e“^dist(/ii, /i 2 )-Po/( 2 diam(iC)) for i = 1 or i = 2. 
The result follows by taking e to 0. □ 

The following example demonstrates the power of Theorem 3 in the context of 
linear classification. Consider an input-output sequence {(xi,yi), . . . , {xm^Vn)} 
arising from some unknown source (not necessarily /3-log-concave) as in the 
classical binary classification problem. Define = {xi : yi = 1} and 

= {xi : yi = —1}. Suppose that the true error is measured according to 
a /3-log-concave distribution, and that X^ and XJ^ are linearly separable. Re- 
call that a linear classifier ft, is a function given by h(x) = sign((a:, u) + ft), where 
‘sign’ is the sign function and is the standard inner product in K”. The 

following proposition provides a lower bound on the true error. We state it for 
generic sets of vectors, so the data are not assumed to be sampled from any 
concrete source. The lower bound concerns the case where we are faced with a 
choice from a set of classifiers, all of which agree with the data (i.e., zero training 
error). If we commit to any specific classifier, then there exists another classifier 
(whose training error is zero as well) such that the true error of the classifier we 
committed to is relatively large if the other classifier happens to equal ft*’’”'^’. 

Proposition 1. Suppose that we are given two sets of linearly separable vec- 
tors X~^ and X~ and let t = d(conv(Al+),conv(Al“)). Then for every linear 
classifier ft that separates X^ and X~ , and any (3 -log- concave distribution f 
and induced measure /x defined on a bounded set K , there exists another lin- 
ear classifier ft' that separates the X^ and X~ as well, such that A{h; ft') > 
e~^Pot/{2dia,m{K)), where Pq = min{/x({a: : (x,u) > {x~^ , u)}) , K{x '■ 
(x,u) < (x“,m)})} for some x^ G conv(X=*=) such that d{x^,x~) = t and 
u = (cc+ — x~)l2. 

Proof. Let H be the set of all hyperplanes that separate from X~ . 
It follows by a standard linear programming argument (see [BBOO]) that 
sup^^ dist(fti, ft. 2 ) = t. This is attained for fti(x) = sign((x,u) — (cc“'',u)) 
and h 2 {x) = sign((a;,t6) — (x~,u)). We now apply Theorem 3 to obtain the de- 
sired result. Note that Pq in the declaration of the proposition is tighter than 
Pq in Theorem 3. This is the result of calculating ^(RT+(fti)) and /x(ftf“(ft 2 )) 
directly (instead of taking the infimum as in Theorem 3). □ 

We now consider the standard machine learning setup, and assume that the 
data are sampled from a /3-log-concave distribution. We examine the geometric 
margin as opposed to the “functional” margin which is often defined with respect 
to a real valued function g. In that case classification is performed by considering 
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h{x) = sign(g(a:)) and the margin of g at {x, y) G M” x{— 1 , 1 } is defined as g{x)y. 
If such a function g is Lipschitz with a constant L, then for x G K^(h) the event 
that {d{x,K~ {h)) < 7 } is contained in the event that {g{x) < 7 L} (and for 
X G K~{h) if d{x,K~{h)) < 7 then —g{x) < jL). Consequently, results on the 
geometric margin can be easily converted to results on the “functional” margin 
as long as the Lipschitz assumption holds. 

Suppose now that we have a classifier h, and we ask the following question: 
what is the probability that if we sample N vectors X = Xi, . . . from /, they 
are far away from the boundary between K^(h) and K~(h). More precisely, we 
want to bound the probability of the event {minj.a,^g;f+(/i) d{xi, K~ (h)) > 7 }, 
and similarly for negatively labelled samples. We next show that the probability 
that the distance of a sampled point from the boundary is almost linear in 
this distance to the boundary. An immediate consequence is an exponential 
concentration inequality. 

Proposition 2. Suppose we are given a classifier h defined on a bounded set 
K. Fix some 7 > 0 and consider the set B = {x € K~{h) : d{x,K^{h)) < 7 }. 
Let f be a fi -log -concave distribution on K with induced measure p,. Then 

^ di^ i + ((S/d'‘l(A-) } - 

Proof. Consider the decomposition of K to Ki = K~^{h), B, and K 2 = K~{h) \ 
B. By Theorem 2 we know that p{B) > 7 e“^min{/i(Ali),/i(Ar 2 )}/diam(A'). We 
also know that p{B) = p{K~{h)) — p{K 2 ). So that 

p{B) > maxjye”^ min{^(A'i), s}/diam(AT), — s}, (4.5) 

where s = Minimizing over s in the interval [0, p{K~ (h))], it is 

seen that the minimizer s is either at the point where p{K~(h)) — s = 
p{Ki)/ diam(AT) or at the point where p{K~{h)) — s = S 7 e“^/diam(A'). 
Substituting those s in Eq. (4.5) and some algebra gives the desired result. □ 
A similar result holds by interchanging and K~ throughout Proposition 
2. The following corollary is an immediate application of the above. 

Corollary 3. Suppose that N samples = {xi , . . . , x^} are drawn indepen- 
dently from a fi -log- concave distribution f defined on a bounded set K. Let h be 
a classifier. Then for every 7 > 0.' 

Pr( min d{xi, K+ {h)) > j] < exp ( -NyC min \ p{K+ (h)), 

\{i:a,i&K-(h)} J \ ( I+7C 

where Pr is the probability measure of drawing N samples from f and C = 
e~^ ! diam(AT). 

Proof. The proof follows from Proposition 2 and the inequality (1 — a)'^ < 
exp(— oA^) for a G [0, 1] and A^ > 0. □ 

Corollary 3 is a dimension-free inequality. It implies that when sampling 
from a /3-log-concave distribution, for any specific classifier, we cannot hope to 
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have a large margin. It does not claim, however, that the empirical margin is 
small. Specifically, for Xjy = {xi,... ,xpf} one can consider the probabilis- 
tic behavior of the following empirical gap between the classes: ga,p{X j\r,h) = 

The probability that this quantity is larger than 7 
cannot be bounded in a dimension-free manner. The reason is that as the number 
of dimensions grows to infinity the distance between the samples may become 
bounded away from zero. To see that, consider uniformly distributed samples 
on the unit ball in K". If n is much bigger than N it is not hard to prove that 
all the sampled vectors will be (with high probability) equally far apart from 
each other. So gap(AC at; h) does not converge to 0 (for every non trivial h) in the 
regime where n increases fast enough with N. For every fixed n one can bound 
the probability that ga,p{Xiq]h) is large using covering number arguments, as 
in [SC99], but such a bound must be dimension-dependent. 

We finally note that a uniform bound in the spirit of Corollary 3 is of interest. 
Specifically, let the empirical margin of a classifier h on sample points be 
denoted by: 

margin(XAr; h) = min{d {{Xn n K~ {h)) , K+ {h)) , d {{X m n K+ {h)) , K~ {h))} . 

It is of interest to bound Pr (sup^g^ margin(X N',h) > 7 ). We leave the issue of 
efficiently bounding the empirical margin to future research. 



5 Regression Tubes 

Consider a function k from M" to K'". In this section we provide a result of a 
different flavor that concerns the weight of tubes around k. The probabilistic 
setup is as follows. We have a probability measure / on that prescribes 

the probability of getting a pair (x,y) € K" x M"*. For a function k : M" — >■ 
we consider the set 

Teo,ei{k) = {{x,y) : eo < \\k{x) - y\\ < ei}. 

This set represents all the pairs where the prediction of k is off by more than cq 
and less then ci, or alternatively, the set of pairs whose prediction is converted 
to zero error when changing the e in an e-insensitive error criterion from eg to 
ei. 

Corollary 4. Suppose that f is (3-log-concave on a bounded set K C K"+™, 
with induced measure p.. Assume that k is Lipschitz continuous with constant L. 
Then for every ei > Cq > 0 

d{Teo,ei{k)) > *'2diam(X) p{T^,,dis.m{K){k))} . 

Proof. We use Theorem 2 with the decomposition Ki = To_e^(A:), B = 

and K 2 = T<;i,diam(iC)(^)- Note that d(To,eo,T,j_diam(iC)) > (ei -eo)/i, since k is 

Lipschitz with constant L. □ 
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A result where / is conditionally /3-log-concave (i.e., given that x was sam- 
pled, the conditional probability of y is /3-log-concave) is desirable. This requires 
some additional continuity assumptions on /, and is left for future research. 
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Abstract. The Bayes classifier achieves the minimal error rate by con- 
structing a weighted majority over all concepts in the concept class. The 
Bayes Point [1] uses the single concept in the class which has the min- 
imal error. This way, the Bayes Point avoids some of the deficiencies of 
the Bayes classifier. We prove a bound on the generalization error for 
Bayes Point Machines when learning linear classifiers, and show that it 
is at most ~ 1.71 times the generalization error of the Bayes classifier, 
independent of the input dimension and length of training. We show 
that when learning linear classifiers, the Bayes Point is almost identical 
to the Tukey Median [2] and Center Point [3]. We extend these defini- 
tions beyond linear classifiers and define the Bayes Depth of a classifier. 
We prove generalization bound in terms of this new definition. Finally 
we provide a new concentration of measure inequality for multivariate 
random variables to the Tukey Median. 



1 Introduction 

In this paper we deal with supervised concept learning in a Bayesian framework. 
The task is to learn a concept c from a concept class C. We assume that the 
target c is randomly chosen from C according to a known probability distribution 
V. The Bayes classifier is known to be optimal in this setting, i.e. it achieves 
the minimal possible expected loss. However the Bayes classifier suffers from 
two major deficiencies. First, it is usually computationally infeasible, since each 
prediction requires voting over all parameters. The second problem is the possible 
inconsistency of the Bayes classifier [4], as it is often outside of the target class. 
Consider for example the following scenario: Alice, Bob and Eve would like to 
vote on the linear order of three items A,B and C. Alice suggests A < B < C, 
Bob suggests C < A < Bj and Eve suggests B <. C <. A. Voting among the three, 
as the Bayes classifier does, will lead to A < B, B < C and C < A which does 
not form a linear order. 

The computational infeasibility and possible inconsistency of the Bayes op- 
timal classifier are both due to the fact that it is not a single classifier from the 
given concept class but rather a weighted majority among concepts in the class. 
These drawbacks can be resolved if one selects a single classifier in the proper 
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class (or a proper ordering in the previous example). Indeed, once the single 
concept is selected, its predictions are usually both efficient and consistent. It is, 
however, no longer Bayes optimal. Our problem is to find the single member of 
the concept class which best approximates the optimal Bayes classifier. 

Herbrich, Graepel and Campbell [1] have recently studied this problem. They 
called the single concept which minimizes the expected error the Bayes Point. 
Specifically for the case of linear classifiers, they designed the Bayes Point Ma- 
chine (BPM), which employs the center of gravity of the version space (which 
is convex in this case) as the candidate classifier. This method has been applied 
successfully to various domains, achieving comparable results to those obtained 
by Support Vector Machines [5]. 

1.1 The Results of This Paper 

Theorem 1 provides a generalization bound for Bayes Point Machines. We show 
that the expected generalization error of BPM is greater than the expected 
generalization error of the Bayes classifier by a factor of at most (e — 1) ~ 1.71. 
Since the Bayes classifier obtains the minimal expected generalization error we 
conclude that BPM is “almost” optimal. Note that this bound is independent 
of the input dimension and it holds for any size of the training sequence. These 
two factors, i.e. input dimension and training set size, affect the error of BPM 
only through the error of the optimal Bayes classifier. The error of Bayes Point 
Machines can also be bounded in the online mistake bound model. In theorem 2 
we prove that the mistake bound of BPM is at most _ iog(i_i/e) where 

n is the input dimension, i? is a bound on the norm of the input data points, 
and r is a margin term. This bound is different from Novikoff’s well known 
mistake bound for the perceptron algorithm [6] of R^jr"^. In our new bound, 
the dependency on the ratio R/r is logarithmic, whereas Novikoff’s bound is 
dimension independent. 

The proofs of theorems 1 and 2 follow from a definition of the proximity of 
a classifier to the Bayes optimal classifier. In the setting of linear classifier the 
proximity measure is a simple modification of the Tukey Depth [2]. The Tukey 
Depth measures the centrality of a point in IR". For a Borell probability measure 
V over M” the Tukey Depth (or halfspace depth) of a; G IR" is defined as 

D(a;) = inf (i7) s.t. His half space andx G iJ} , (1) 

i.e. the depth of x is the minimal probability of an half space which contains 
X. Using this definition Donoho and Gasko [7] defined the Tukey Median as the 
point X which maximizes the depth function D{x) (some authors refer to this 
median as the Center Point [3]). 

Donoho and Gasko [7] studied the properties of the Tukey Median. They 
showed that the median always exists but need not be unique. They also showed 
that for any measure v over IR", the depth of the Tukey Median is at least 
Gaplin and Nalebuff [4] proved the Mean Voter Theorem. This theorem (using 
different motivations and notations) states that if the measure v is log-concave 
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then the center of gravity of v has a depth of at least 1/e. ly is log-concave if it 
conforms with 

(AA +{l-X)B)>iy ■ 

For example, uniform distributions over convex bodies are log-concave, normal 
and chi-square distributions are log-concave as well. See [8] for a discussion and 
examples of log-concave measures (a less detailed discussion can be found in 
appendix A). 

The lower bound of 1 /e for the depth of the center of gravity for log-concave 
measures is the key to our proofs of the bounds for BPM. The intuition behind 
the proofs is that any ’’deep” point must generalize well. This can be extended 
beyond linear classifiers to general concept classes. We define the Bayes Depth 
of a hypothesis and show in theorem 3 that the expected generalization error of 
any classifier can be bounded in terms of its Bayes Depth. This bound holds for 
any concept class, including multi-class classifiers. 

Finally we provide a new concentration of measure inequality for multivariate 
random variables to their Tukey Median. This is an extension of the well known 
concentration result of scalar random variables to the median [9] . 

This paper is organized as follows. In section 2 the Bayes Point Machine is 
introduced and the generalization bounds are derived. In section 3 we extend 
the discussion beyond linear classifiers. We define the Bayes Depth and prove 
generalization bounds for the general concept class setting. A concentration of 
measure inequality for multivariate random variables to their Tukey Median is 
provided in section 4. Further discussion of the results is provided in section 
5. Some background information regarding concave measures can be found in 
appendix A. The statement of the Mean Voter Theorem is given in appendix B. 

1.2 Preliminaries and Notation 

Throughout this paper we study the problem of concept learning with Bayesian 
prior knowledge. The task is to approximate a concept c G C which was chosen 
randomly using a probability measure v. The Bayes classifier (denoted by /lopt) 
assigns the instance x to the class with minimal expected loss: 

^opt (x) = arg min [I {y, c (x))] (2) 

where I is some loss function I : — >■ IR- The Bayes classifier is optimal among 

all possible classifiers since it minimizes the expected generalization error: 

error (h) = [Ecr..u [I {h (x) , c (x))]] (3) 

The Bayes classifier achieves the minimal possible error on each individual in- 
stance X and thus also when averaging over x. If a labeled sample is available 
the Bayes classifier uses the posterior induced by the sample, and likewise the 
expected error is calculated with respect to the same posterior. If the concepts in 
C are stochastic then the loss in (2) and (3) should be averaged over the internal 
randomness of the concepts. 
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2 Bayes Point Machine 

Herbrich, Graepel and Campbell [1] introduced the Bayes Point Machine as a 
tool for learning classifiers. They defined the Bayes Point as follows: 

Definition 1. Given a concept class C, a loss function ? : 3^ x y — >■ IR and a 
posterior v over C, the Bayes Point is: 

arg minify; [I (/i(x), c(x))]] 

Note that [I (^(a:), c(x))]] is the average error of the classifier h, as 

defined in (3), and thus the Bayes Point, as defined in definition 1, is simply the 
classifier in C which minimizes the average error, while the Bayes optimal rule 
minimizes the same term without the restriction of choosing h from C. 

When applying to linear classifiers with the zero-one loss function^, [1] as- 
sumed a uniform distribution over the class of linear classifiers. Furthermore 
they suggested that the center of gravity is a good approximation of the Bayes 
Point. In theorem 1 we show that this is indeed the case. The center of gravity 
is indeed a good approximation of the Bayes Point. 

We will consider the case of linear classifiers through the origin. In this case 
the sample space is IR” and a classifier is half-space through the origin. Formally, 
any vector 9 G IR" represents a classifier. Given an instance x G IR” the corre- 
sponding label is -1-1 if 6* • x > 0 and —1 otherwise. Note that if A > 0 then the 
vector 6 and the vector \9 represent the same classifier; hence we may assume 
that 6 is in the unit ball. 

Given a sample of labeled instances, the Version Space is defined as the set 
of classifiers consistent with the sample: 

Version-Space = {9 : ||0|| < 1 and Pi9 ■ Xi > 0 forall 1 < i < m} 

This version space is the intersection of the unit ball with a set of linear con- 
straints imposed by the observed instances and hence it is convex. The posterior 
is the restriction of the original prior to the version space. Herbrich et al. [1] 
suggested using the center of gravity of the version space as the hypothesis of 
the learning algorithm which they named the Bayes Point Machine. They sug- 
gested a few algorithms which are based on random walks in the version space 
to approximate the center of gravity. 



2.1 Generalization Bounds for Bayes Point Machines 

Our main result is a generalization bound for the Bayes Point Machine learning 
algorithm. 

^ The zero-one loss function is zero whenever the predicted class and the true class 
are the same. Otherwise, the loss is one. 




Bayes and Tukey Meet at the Center Point 553 



Theorem 1. Let v be a continuous log-concave measure^ over the unit ball in 
IR" (the prior) and assume that the target concept is chosen according to v. Let 
BPM be a learning algorithm such that after seeing a batch of labeled instances S 
returns the center of gravity of v restricted to the version space as a hypothesis 
^bpm- ^opt(') Bayes optimal classifier. For any x G IR” and any 

sample S 



Pr 



^bpm(^) c(a;) [S' 



< (e — 1) Pr 

C 



^opt(a;) ^ c{x) |S' 



Theorem 1 proves that the generalization error of /i-bpm is at most (e— 1) ^1.7 
times larger than the best possible. Note that this bound is dimension free. There 
is no assumption on the size of the training sample S or the way it was collected. 
However, the size of S, the dimension and maybe other properties influence the 
error of /lopt &nd thus affect the performance of BPM. 



Proof. If n is log-concave, then any restriction of to a convex set is log-concave 
as well. Since the version space is convex, the posterior induced by S is log- 
concave. Let X G IR” be an instance for which the prediction is unknown. Let H 
be the set of linear classifiers which predict that the label of a; is -1-1, therefore 



H = {0 : 0-x>O} 



and hence H is a half-space. Algorithm ft-opt predict that the label of x 
is -1-1 iff iy{H\S) > 1/2. W.l.o.g. assume that iy{H\S) > 1/2. We consider two 
cases. 

First assume that iy{H\S) > 1 — 1/e. From theorem 6 and the definition of 
the depth function (1) it follows that any half space with measure > 1 — 1/e 
must contain the center of gravity. Hence the prediction made by is the 

same as the prediction made by /lopt- 

The second case is when 1/2 < v {H\S) < 1 — 1/e. If BPM predicts that the 
label is -|-1, then it suffers from the same error as /lopt- If ^bpm predicts that 
the label of a; is —1 then: 



Prc 


^bpm(^) 7^ e(x) [S' 


Prc 


^opt(a:) c{x) 15 



i^{H\S) ^ 1-1/e 

l-n{H\S)~ 1/e 



Note that if v (H\S) <1/2 the prediction of fiopt be that the label of x 
is —1 and we can apply the same proof to 



H = {0 : 6»- a; < 0} 



□ 
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Fig. 1. Although the white point is close (distance wise) to the Tukey Median (in 
black), it does not have large depth, as demonstrated by the dotted line. 



2.2 Computational Complexity 

Theorem 1 provides a justification for the choice of the center of gravity in the 
Bayes Point Machine [1]. Herbrich et al. [1] suggested algorithms for approxi- 
mating the center of gravity. In order for our bounds to follow for the approx- 
imation, it is necessary to have some lower bound on the Tukey Depth of the 
approximating point. For this purpose. Euclidean proximity is not good enough 
(see figure 1). Bertsimas and Vempala [10] have suggested a solution for this 
problem. The algorithm they suggest requires 0*(n^) operations where n is the 
input dimension. However it is impractical due to large constants. Nevertheless, 
the research in this field is active and faster solutions may emerge. 



2.3 Mistake Bound 

The On-line Mistake-Bound model is another common framework in statistical 
learning. In this setting the learning is an iterative process, such that at iteration 
i, the student receives an instance Xi and has to predict the label yi. After 
making this prediction, the correct label is revealed. The goal of the student is 
to minimize the number of wrong predictions in the process. 

The following theorem proves that when learning linear classifiers in the on- 
line model, if the student makes its predictions using the center of gravity of 
the current version space, then the number of predictions mistakes is at most 
- iog(i-i/e) ^ where i? is a radius of a ball containing all the instances and 
r is a margin term. Note that the algorithm of the perceptron has a bound of 
E? /r^ in the same setting [6]. Hence the new bound is better when the dimension 
n is finite (i.e. small). 

Theorem 2. Let{{xi, C IR"x{— 1, 1} he a sequence such that \\xiW 2 — ^ 

and there exists r > 0 and a unit vector 9 € K" such that yiXi ■ 9 > r for any i. 
Let BPM be an algorithm that predicts the label of the next instance Xm+i to be 
the label assigned by the center of gravity of the intersection of the version space 
induced by {{xi,yi)}^^ and the unit ball. The number of prediction mistakes 
that BPM makes is at most _ iog(i-i/e) ^ ■ 

^ See appendix A for discussion and definitions of concave measures. Note however, 
that the uniform distribution over the version space is always log-concave. 
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Proof. Recall that the version space is the set of all linear classifiers (inside the 
unit ball) which correctly classifies all instances seen so far. The proof track is 
as follows: first we will show that the volume of the version space is bounded 
from below. Second, we will show that whenever a mistake occurs, the volume 
of the version space reduces by a constant factor. Combining these two together, 
we conclude that the number of mistakes is bounded. 

Let 0 be a unit vector such that yiXi-9 > r. Note that if \\9' — 911^ < r/R then 
yiXi ■ 9' > 0. Therefore, there exists a ball of radius r/2R inside the unit ball of 
IR" such that all 9' in this ball correctly classify all xfs. Hence, the volume of the 
version space is at least (r/2i?)”R„ where Vn is the volume of the n-dimensional 
unit ball. 

Assume that BPM made a mistake while predicting the label of Xi. W.l.o.g. 
assume that BPM predicted that the label is +1. Let H — {9 : 9 ■ Xi> 0}, since 
the center of gravity is in H, and the Tukey Depth of the center of gravity > 1/e, 
the volume of H is at least 1/e of the volume of the version space. This is true 
since the version space is convex and the uniform measure over convex bodies is 
log-concave. 

Therefore, whenever BPM makes a wrong prediction, the volume of the ver- 
sion space reduces by a factor of (1 — 1/e) at least. Assume that BPM made k 
wrong predictions while processing the sequence {{xi,yi)}'^^ then we have that 
the volume of the version space is at most (l — and at least and 

thus we conclude that 



k < 



n 

-log (l 




2R 

r 



□ 



3 The Bayes Depth 



As we saw in the previous section the Tukey Depth plays a key role in bounding 
the error of Bayes Point Machine when learning linear classifiers. We would 
like to extend these results beyond linear classifiers; thus we need to extend the 
notion of depth. Recall that the Tukey Depth (1) measures the centrality of a 
point with respect to a probability measure. We say that a point x € IR” has 
depth D = D{x) if when standing at x and looking in any direction, the points 
you will see have a probability measure of D at least. The question is thus how 
can we extend this definition to other classes? How should we deal with multi- 
class partitions of the data, relative to the binary partitions in the linear case? 
For this purpose we define Bayes Depth: 



Definition 2. Let C he a concept class such that c € C is a function c : A — >■ y. 
Let I : y X y ^ TR be a loss function, and let v he a probability measure over C. 
The Bayes Depth of a hypothesis h is 



^ Bayes W = 



min^gy Ecr., 1 , [I {y, c (x))] 
[I (h{x) ,c(x))] 



(4) 
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The denominator in (4) is the expected loss of h when predicting the class of 
X, while the numerator is the minimal possible expected loss, i.e. the loss of the 
Bayes classifier. Note that the hypothesis h need not be a member of the concept 
class C. Furthermore, it need not be a deterministic function; if h is stochastic 
then the loss of h should be averaged over its internal randomness. 

An alternative definition of depth is provided implicitly in definition 1. Recall 
that Herbrich et al. [1] defined the Bayes Point h as the point which minimizes 
the term 

Ex [Ec^^ [I (/i(x), c(a;))]] (5) 

when I is some loss function. Indeed the concept which minimizes the term in (5) 
is the concept with minimal average loss, and thus this is a good candidate for 
a depth function. However, evaluating this term requires full knowledge of the 
distribution of the sample points. This is usually unknown and in some cases it 
does not exist since the sample point might be chosen by an adversary. 

3.1 Examples 

Before going any further we would like to look at a few examples which demon- 
strate the definition of Bayes Depth. 

Example 1. Bayesian prediction rule 

Let h be the Bayesian prediction rule, i.e. h{x) = min^gy; Ec'r^i, [I {y,c' (x))]. 
It follows from the definition of depth that -Dgayes ^o^e that any 

prediction rule cannot have a depth greater than 1. 



Example 2. MAP on finite concept classes 

Let C be a finite concept class of binary classifiers and let I be the zero-one 
loss function. Let h = argmaxcgc i.e. h is the Maximum A-Posteriori. Since 
C is finite we obtain v{h) > 1/|C|. Simple algebra yields T^g^yes — |c|-i • 



Example 3. Center of Gravity 

In this example we go back to linear classifiers. The sample space consists of 
tuples (x, 6) such that x G IR” and 6 G IR. A classifier is a vector w G IR" such 
that the label w assigns to (x, b) is sign(w -x + h). The loss is the zero-one loss as 
before. Unlike the standard setting of linear classifiers the offset b is part of the 
sample space and not part of the classifier. This setting has already been used 
in [11]. In this case the Bayes Depth is a normalized version of the Tukey Depth: 

^Bayes (^) i _ 



Example 4- Gibbs Sampling 

Our last example uses the Gibbs prediction rule which is a stochastic rule. 
This rule selects at random c G C according to ly and uses it to predict the 
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label of X. Note that Haussler et al. [12] already analyzed this special case 
using different notation. Let h be the Gibbs stochastic prediction rule such 
that Pr [/i(cc) = y] = v {c : c(x) = y}. Let I be the zero-one loss function As- 
sume that y = {— 1,-|-1}, and denote by p = v {c : c{x) = -1-1}. We obtain 

^BayesW>infpe(o,i)=l^ = 0.5. 

3.2 Generalization Bounds 

Theorems 1 and 2 are special cases of a general principle. In this section we show 
that a “deep” classifier, i.e. a classifier with large Bayes Depth, generalizes well. 
We will see that both the generalization error, in the batch framework, and the 
mistake bound, in the online framework, can be bounded in terms of the Bayes 
Depth. 

Theorem 3. Let C he a parameter space and let v be a probability measure 
(prior or posterior) over C and I be a loss function. Let h he a classifier then for 
any probability measure over X 

Ec^^E,^[l{h{x) ,c{x))]< — — j-:-Ecr..uEx I (ho-ptix) ,c{x)) (6) 

E Bayes W L V ^ 7J 

where /iopt(’) E the optimal predictor, i.e. the Bayes prediction rule. 

The generalization bound presented in (6) differs from the common PAG 
bounds (e.g. [13,14, ...]). The common bounds provide a bound on the general- 
ization error based on the empirical error. (6) gives a multiplicative bound on 
the ratio between the generalization error and the best possible generalization 
error. A similar approach was used by Haussler et al. [12]. They proved that the 
generalization error of the Gibbs sampler is at most twice as large as the best 
possible. 

Proof. Let x € X and let D = Llgayes depth of h. Thus , 

^ min^gy [I (y, d (a:))] 

Ec'.^,,[l{h{x) ,d (x))] 

Therefore, 

Ec'r., 1 , [I {h (x) , c' (a;))] < ^ min Ec'r^„ [I {y, c' (a;))] 

JJ yey 

= I (hopt (x) , c' (a;)) (7) 

Averaging (7) over x we obtain the stated result. □ 

We now turn to prove the extended version of theorem 2, which deals with 
the online setting. This analysis resembles the analysis of the Halving algorithm 
[15]. However, the algorithm presented avoids the computational deficiencies of 
the Halving algorithm. 
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Theorem 4. Let ® sequence of labeled instances where Xi G 

X and Hi G {±1}- Assume that there exists a probability measure v over a 
concept class C such that v {c € C : Vi c{xi) = yt} > 7 > 0. Let L be a learning 
algorithm such that given a training set S = {{xi,yi)}™_-^, L returns a hypothesis 
h which is consistent with S and such that D Bayes > 0 (with respect 

to the measure v restricted to the version-space and the zero-one loss). Then the 
algorithm which predicts the label of a new instance using the hypothesis returned 
by L on the data seen so far will make at most 

log 1/7 
log(l + L>o) 



mistakes. 

Proof. Assume that the algorithm presented made a mistake in predicting the 
label of Xm- Denote by Vm-i the version space at this stage; then 

Vm-i = {cGC : Vl<z<m, c{xi) = yj 

from the definition of the version space and the assumptions of this theorem we 
have that i'{Vm-i) > 7- We will consider two cases. One is when the majority 
of the classifiers are misclassifies Xm, and the second is when only the minority 
misclassifies. If the majority made a mistake then v{Vm) < \v{Vm-i). 

However if the minority made a mistake, the hypothesis h returned by L is 
in the minority, but since {h) > Dq we obtain 

^ : c{Xm) = -Vm} .gx 

U {c G ^rn—1 ■ ^{^m) ~ Vm} 

Note that the denominator in (8) is merely v{Vm) while the numerator is 
v{V,n-l) - v{Vm)- Thus 



y.. ^ 1) ^(Tm) .. 

° “ v{V^) y{Vm) 

and thus v{Vm) < 

If there were k wrong predictions on the labels of xi, . . . , Xm then 

while 7 < v{Vm) and thus, since Dq is upper bounded by 1, we conclude 

Jo^ 

“ log 

^'Jg 1+Do 



□ 
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4 Concentration of Measure for Multivariate Random 
Variables to the Tukey Median 

In previous sections we have seen the significance of the Tukey Depth [2] in prov- 
ing generalization bounds. Inspired by this definition we also used the extended 
Bayes Depth to prove generalization bounds on general concept classes and loss 
functions. However, the Tukey Depth has many other interesting properties. For 
example, Donoho and Gasko [7] defined the Tukey Median as the point which 
achieves the best Tukey Depth. They showed that such a point always exists, but 
it need not be unique. The Tukey Median has high breakdown point [7] which 
means that it is resistant to outliers, much like the univariate median. 

In this section we use Tukey Depth to provide a novel concentration of mea- 
sure inequality for multivariate random variables. The theorem states that any 
Lipschitz^ function from a product space to IR" is concentrated around its Tukey 
Median. 

Theorem 5. Let . . . , fid he measurable spaces and let X = f2i x ... x fid 
be the product space with P being a product measure. Let F : X — > IR” he a 
multivariate random variable such that F is a Lipschitz function in the sense 
that for any x G X there exists a = a{x) G IR^J. with ||a ||2 = 1 such that for every 
V&X 

\\F{x) - F{y)\\^< a, (9) 

i ■■ 

Assume furthermore that F is hounded such that 117^(3;) — F{y)\\ < M. 

Let z € IR" then for any r > 0 

Px[\\F{x)-z\\>r]<(^y (10) 

where D{z) is the Tukey Depth of z with respect to the push forward measure 
induced by F. 

Proof. Let w G IR" be in the unit ball. From (9), it follows that if a = a(x) then 
for any y G IR" 

F{x) -w - F{y) -w = {F{x) - F{y)) -w <\\F{x) - F{y)\\\\w\\ < Y “i 

i ■■ 

which means that the functional x — >■ F{x) ■ w is Lipschitz. Let 2 G IR" then 
Pr£c~p ■ w < z ■ w] > D{z). Using Talagrand’s theorem [16] we conclude 
that 

Pr \F{x) ■ w > z ■ w + r/2] < — 
xr^p^ D{z) 

clearly this will hold for any vector w such that llicjl < 1. 

® Lipschitz is in Talagrand’s sense. See e.g [9, pg 72-79]. 
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Let W he a, minimal r/2M covering of the unit sphere in M", i.e. for any unit 
vector u there exists w G W such that ||m — icH < r/2M. W.l.o.g. W is a subset 
of the unit ball, otherwise project all the points in W onto the unit ball. Since 
W is minimal then \W\ < {AM/r)'^. Using the union bound over all w G W it 
follows that 

Pr [die G W, F(x) ■ w > z ■ w + r/2] < ( — 
xr^p y r 

Finally we claim that if x is such that ||F(x) — z|| > r then there exists 
w G W such that F{x) ■ w > z ■ w + r/2. For this purpose we assume that 
2 G conv(F(X)) otherwise the statement is trivial since F(z) = 0. Let 




then M is a unit vector and 



u = 



F{x) - z 



F{x) ■ u — z ■ u = (F{x) — z) ■ u = ||P’(a;) — 2 :|| > r 



Since ru is a cover of the unit sphere and m is a unit vector, there exist w G W 
such that lire — uH < r/2M. 

F{x) ■ w — z ■ w = {F(x) — z) ■ w 

= (F{x) — z) ■ u + (F{x) — z) ■ {w — u) 

> r — ||U(a;) — z\\ Hw — uH 
>r-{M) (r/2M) 

= r/2 

and thus F{x) ■ w > z ■ w + r/2. Hence, 



Pr [||U(a;) 

X 



z\\ > r] < Pr G W, F(x) ■ w > z ■ la + r/2] 

X 



< 




-U/16 

D{z) 



□ 



Corollary 1. In the setting of theorem 5, if mp is the Tukey Median of F, i.e. 
the Tukey Median of the push-forward measure induced by F then for any r > 0 

Px [ll-F(a^) - mpW >r]< f ~ j + 1) 

Proof. From Helly’s theorem [3] it follows that D (mp) > l/(n+l) for any 
measure on IR". Substitute this in (10) to obtain the stated result. □ 

Note also that any Lipschitz function is bounded since 
\\F{x)-F{y)\\< Qi < y/d 

i.Xi^Vi 

hence M in the above results is bounded by y/ d. 
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Fig. 2. A comparison of the Tukey Median (in black) and the maximal margin point 
(in white). In this case, the maximal margin point has small Tukey Depth 



5 Summary and Discussion 

In this paper we present new generalization bounds for Bayes Point Machines 
[1]. These bounds apply the mean voter theorem [4] to show that the generaliza- 
tion error of Bayes Point Machines is greater than the minimal possible error 
by at most a factor of (e — 1) ^ 1.71. We also provide a new on-line mistake 
bound of _ iog(i-i/e) (2^/r) ~ 2.18nln {2R/r) for this algorithm. 

The notion of Bayes Point is extended beyond linear classifiers to a general 
concept class. We defined the Bayes Depth in the general supervised learning 
context, as an extension of the familiar Tukey Depth. We give examples for calcu- 
lating the Bayes Depth and provide a generalization bound which is applicable 
to this more general setting. Our bounds hold for multi-class problems and for 
any loss function. 

Finally we provide a concentration of measure inequality for multivariate 
random variables to their Tukey Median. This inequality suggests that the cen- 
ter of gravity is indeed a good approximation to the Bayes Point. This provides 
additional evidence for the fitness of the Tukey Median as the multivariate gen- 
eralization of the scalar median (see also [17] for a discussion on this issue). 

The nature of the generalization bounds presented in this paper is different 
from the more standard bounds in machine learning. Here we bound the multi- 
plicative difference between the learned classifier and the optimal Bayes classifier. 
This multiplicative factor is a measure of the efficiency of the learning algorithm 
to exploit the available information. On the other hand, the more standard PAC- 
like bounds [13,14, ...], provide an additive bound, on the difference between the 
training error and the generalization error, with high confidence. The advantage 
of additive bounds is in their performance guaranty. Nevertheless, empirically it 
is known that PAG bounds are very loose due to their worst case distributional 
assumptions. The multiplicative bounds are tighter than the additive ones in 
these cases. 

The bounds for linear Bayes Point Machines and the use of Tukey Depth 
can provide another explanation for the success of Support Vector Machines [5] . 
Although the depth of the maximal margin classifier can be arbitrarily small 
(see figure 2), if the version space is “round” the maximal margin point is close 
to the Tukey Median. We argue that in many cases this is indeed the case. 

There seems to be a deep relationship between Tukey Depth and Active 
Learning, especially through the Query By Committee (QBC) algorithm [11]. 
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The concept of information gain, as used by Freund et al. [11] to analyze the 
QBC algorithm, is very similar to Tukey Depth. This and other extensions are 
left for further research. 

Acknowledgments. We thank Ran El-Yaniv, Amir Globerson and Nati Linial 
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A Concave Measures 

We provide a brief introduction to concave measures. See [8,4,18,19] for more 
information about log-concavity and log-concave measures. 

Definition 3. A probability measure v over M" is said to be log-concave if for 
any measurable sets A and B and every 0 < A < 1 the following holds: 

V {XA +{l-\)B)>ir {Af V {B)^~^ 

Note that many common probability measures are log-concave, for example 
uniform measures over compact convex sets, normal distributions, chi-square 
and more. Moreover the restriction of any log-concave measure to a convex set 
is a log-concave measure. 

In some cases, there is a need to quantify concavity. The following definition 
provides such a quantifier. 

Definition 4. A probability measure v over M" is said to be p-concave if for 
any measurable sets A and B and every 0 < A < 1 the following holds: 

V {XA +{l-X)B)> [Xv {AY + (1 - X)y {Bf^''' 

A few facts about p-concave measures: 

— If is p-concave with p = oo then v{XA -|- (1 — A)i?) > max(j/(A), v{B)). 

— If is p-concave with p = — oo then n{XA -|- (1 — X)B) > min(i/(A), v{B)). 

— If is p-concave with p = 0 then v{XA+ (1 — X)B) > v{AYv{BY~^ , in this 
case V is called log-concave. 



B Mean Voter Theorem 



Caplin and Nalebuff [4] proved the Mean Voter Theorem in the context of the 
voting problem. They did not phrase their theorem using Tukey Depth but the 
translation is trivial. Hence, we provide here (without proof) a rephrased version 
of their theorem. 



Theorem 6. ( Caplin and Nalebuff) Let v be a p-concave measure over M" with 
p > — l/(n -I- 1). Let z be the center of gravity of v, i.e. z = Then 



D{z)> 



/ n + l/p \ 
\n+l + l/p) 



n+l/p 



( 11 ) 



where D{-) is the Tukey Depth. 

First note that when p — >■ 0 the bound in (11) approches 1/e; hence for log- 
concave measures D{z) > 1/e. However, this bound is better than 1/e in many 
cases, i.e. when p > 0. This fact can be used to obtain an improved version of 
theorems 1 and 2. 
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Abstract. One of the nice properties of kernel classifiers such as SVMs 
is that they often produce sparse solutions. However, the decision func- 
tions of these classifiers cannot always be used to estimate the condi- 
tional probability of the class label. We investigate the relationship be- 
tween these two properties and show that these are intimately related: 
sparseness does not occur when the conditional probabilities can be un- 
ambiguously estimated. We consider a family of convex loss functions and 
derive sharp asymptotic bounds for the number of support vectors. This 
enables us to characterize the exact trade-off between sparseness and the 
ability to estimate conditional probabilities for these loss functions. 



1 Introduction 

Consider the following familiar setting of a binary classification problem. A se- 
quence T = ((xi, j/i), . . . , (x„, y„)) of i.i.d. pairs is drawn from a probability 
distribution over A' x y where A C and y is the set of labels (which we 
assume is {-1-1,— 1} for convenience). The goal is to use the training set T to 
predict the label of a new observation x G X. A common way to approach the 
problem is to use the training set to construct a decision function /t : A — >■ K 
and output sign(/T(x)) as the predicted label of x. 

In this paper, we consider classifiers based on an optimization problem of the 
form: 

1 " 

/t,a = argminA||/|||^-b -^</>(y*/(xi)) (1) 

2 = 1 

Here, H is a reproducing kernel Hilbert space (RKHS) of some kernel k, X > 0 
is a regularization parameter and ^ : M — >■ [0, oo) is a convex loss function. Since 
optimization problems based on the non-convex function 0-1 loss t i-G- 
(where /(.) is the indicator function) are computationally intractable, use of con- 
vex loss functions is often seen as using upper bounds on the 0-1 loss to make 
the problem computationally easier. Although computational tractability is one 
of the goals we have in mind while designing classifiers, it is not the only one. 
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We would like to compare difTerent convex loss functions based on their statis- 
tical and other useful properties. Conditions ensuring Bayes-risk consistency of 
classifiers using convex loss functions have already been established [2,4,9,12]. It 
has been observed that different cost functions have different properties and it is 
important to choose a loss function judiciously (see, for example, [10]). In order 
to understand the relative merits of different loss functions, it is important to 
consider these properties and investigate the extent to which different loss func- 
tions exhibit them. It may turn out (as it does below) that different properties 
are in conflict with each other. In that case, knowing the trade-off allows one 
to make an informed choice while choosing a loss function for the classification 
task at hand. 

One of the properties we focus on is the ability to estimate the conditional 
probability of the class label r]{x) = P{Y = -|-1|W = x). Under some condi- 
tions on the loss function and the sequence of regularization parameters A„, the 
solutions of (1) converge (in probability) to a function F^{r]{x)) which is set 
valued in general [7]. As long as we can uniquely identify r](x) based on a value 
in F^{rj{x)), we can hope to estimate conditional probabilities using at 

least asymptotically. Choice of the loss function is crucial to this property. For 
example, the L2-SVM (which uses the loss function 1 1 — >■ (max{0, 1 — t})^) is much 
better than Ll-SVM ( which uses 1 1 — >■ max{0, 1 — t}) in terms of asymptotically 
estimating conditional probabilities. 

Another criterion is the sparseness of solutions of (1). It is well known that 
any solution /t,a of (1) can be represented as 

n 

= '^a*k{x,Xi) . ( 2 ) 

i=l 

The observations Xi for which the coefficients a* are non-zero are called support 
vectors. The rest of the observations have no effect on the value of the decision 
function. Having fewer support vectors leads to faster evaluation of the decision 
function. Bounds on the number of support vectors are therefore useful to know. 
Steinwart’s recent work [8] has shown that for the Ll-SVM and a suitable kernel, 
the asymptotic fraction of support vectors is twice the Bayes-risk. Thus, Ll- 
SVMs can be expected to produce sparse solutions. It was also shown that L2- 
SVMs will typically not produce sparse solutions. 

We are interested in how sparseness relates to the ability to estimate condi- 
tional probabilities. What we mentioned about LI and L2-SVMs leads to several 
questions. Do we always lose sparseness by being able to estimate conditional 
probabilities? Is it possible to characterize the exact trade-off between the asymp- 
totic fraction of support vectors and the ability to estimate conditional probabil- 
ities? If sparseness is indeed lost when we are able to fully estimate conditional 
probabilities, we may want to estimate conditional probabilities only in an in- 
terval, say (0.05, 0.95), if that helps recover sparseness. Estimating -q for cc’s that 
have rj{x) > 0.95 may not be too crucial for our prediction task. How can we 
design loss functions which enable us to estimate probabilities in sub-intervals 
of [0, 1] while preserving as much sparseness as possible? 
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This paper attempts to answer these questions. We show that if one wants to 
estimate conditional probabilities in an interval ( 7, 1 — 7 ) for some 7 G ( 0 , 1 / 2 ), 
then sparseness is lost on that interval in the sense that the asymptotic fraction of 
data that become support vectors is lower bounded by E^G{ri{x)) where G{ri) = 
1 throughout the interval ( 7 , 1 — 7 ). Moreover, one cannot recover sparseness by 
giving up the ability to estimate conditional probabilities in some sub-interval 
of ( 7 , 1 — 7 ). The only way to do that is to increase 7 thereby shortening the 
interval ( 7 , 1 — 7 ). We also derive sharp bounds on the asymptotic number of 
support vectors for a family of loss functions of the form: 

= h{{to - t)+), to>0 

where denotes maxjO, t} and ft. is a continuously differentiable convex function 
such that ft'(O) > 0. Each loss function in the family allows one to estimate 
probabilities in the interval ( 7 , 1— 7 ) for some value of 7 . The asymptotic fraction 
of support vectors is then ExG{ri{x)), where G{r]) is a function that increases 
linearly from 0 to 1 as r; goes from 0 to 7 . For example, if (j>{t) = |((1 — 
^) + )^ + |(1 — t)+ then conditional probabilities can be estimated in (1/4, 3/4) 
and G{r]) = 1 for ry G (1/4, 3/4) (see Fig. 1). 





Fig. 1. Plots of (? 7 ) (left) and G{rj) (right) for a loss function which is a convex com- 
bination of the LI and L2-SVM loss functions. Dashed lines represent the corresponding 
plots for the original loss fnnctions. 



2 Notation and Known Results 

Let P be the probability distribution over X x y and let T G (T x J^)" be a 
training set. Let Ep(-) denote expectations taken with respect to the distribution 
P. Similarly, let Ea,(-) denote expectations taken with respect to the marginal 
distribution on X. Let rj{x) be P{Y = -1-1 1 AT = x). For a decision function 
/ : df — >■ M, define its risk as 



Rp{f) = EpJ(j^/(^)<o) . 
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The Bayes-risk Rp = inf{i?p(/) : / measurable} is the least possible risk. Given 
a loss function (j), define the ^risk of / by 

R4>Af) = ■ 

The optimal (/)-risk = inf{i?0_p(/) : / measurable} is the least achievable 
(/)-risk. When the expectations in the definitions of Rp{f) and R,p,p{f) are taken 
with respect to the empirical measure corresponding to T, we get the empirical 
risk Rrif) and the empirical (/erisk R^^pif) respectively. Conditioning on x, we 
can write the 0-risk as 

R<t>,p(f) = Rx[E{(l){yf{x)\x)] 

= E^riixAfix)) -b (1 - ?7(a;))0(-/(a;))] 

= E4C{rj{x), f{x))] . 

Here, we have defined C{r],t) = r](j)(t) -b (1 — ri)4>{—t). To minimize the 0-risk, 
we have to minimize C{r], •) for each 77 € [0, 1]. So, define the set valued function 
by 

= {t ■ C{y,t) = min C{r],s)} 

where K is the set of extended reals K U {—00, 00}. Any measurable selection /* 
of actually minimizes the 0-risk. The function F^ is plotted for three choices 
of 0 in Fig. 1. From the definitions of C{rj,t) and F^{rj), it is easy to see that 
F^iv) = —F^{1 — rf). Steinwart [7] also proves that y >->• F^{'q) is a monotone 
operator. This means that if 771 > 772, G G F^{rji) and t 2 G AA) then > t 2 - 
A convex loss function is called classification calibrated if the following two 
conditions hold: 

F^{r]) C [-00, 0) and 77 > i ^ F^{r]) C (0, -boo] . 

A necessary and sufficient condition for a convex 0 to be classification calibrated 
is that 0'(O) exists and is negative [2]. If 0 is classification calibrated then it 
is guaranteed that for any sequence /„ such that R,p,p{fn) R<j>,Pi we have 
Rp{fn) — >■ Rp- Thus, classification calibrated loss functions are good in the 
sense that minimizing the 0-risk leads to classifiers that have risks approaching 
the Bayes-risk. Note, however, that in the optimization problem (1), we are 
minimizing the regularized 0-risk 

R7lx = M\f\\H + R4:,T ■ 

Steinwart [9] has shown that if one uses an classification calibrated convex loss 
function, a universal kernel (one whose RKHS is dense in the space of continuous 
functions over X) and a sequence of regularization parameters such that A„ — >■ 0 
sufficiently slowly, then i?<7,,p(/T.A„) — >■ R<p,p- In another paper [7], he proves that 
this is sufficient to ensure the convergence in probability of /t,a„ to F^{r]{-)). 
That is, for all e > 0 

Px{{x G A : p{fT,\^{x),F^{j]{x))) > e}) -)> 0 



(3) 
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The function p{t, B) is just the distance from t to the point in B which is closest 
to t. The definition given by Steinwart [7] is more complicated because one has 
to handle the case when B fl K = 0. We will ensure in our proofs that is not 
a singleton set just containing +oo or — oo. 

Since / t , a „ converges to F^{ri{-)), the plots in Fig. 1 suggest that the L2-SVM 
decision function can be used to estimate conditional probabilities in the whole 
range [0, 1] while it not possible to use the Ll-SVM decision function to estimate 
conditional probabilities in any interval. However, the Ll-SVM is better if one 
considers the asymptotic fraction of support vectors. Under some conditions on 
the kernel and the regularization sequence, Steinwart proved that the fraction 
is Ea;[ 2 min(? 7 (a:), 1 — ? 7 (a;))], which also happens to be the optimal (/>-risk for 
the hinge loss function. For L2-SVM, he showed that the asymptotic fraction is 
Px{{x £ A' : 0 < r/(x) < 1}), which is the probability of the set where noise 
occurs. Observe that we can write the fraction of support vectors as Ea;[G(? 7 (a:))] 
where G(r/) = 2min{r7, 1 — p)} for the hinge loss and G(p) = for the 

squared hinge loss. We will see below that these two are extreme cases. In general, 
there are loss functions which allow one to estimate probabilities in an interval 
centered at 1/2 and for which G(Tf) = 1 only on that interval. 

Steinwart [7] also derived a general lower bound on the asymptotic number 
of support vectors in terms of the probability of the set 

*5* — {(^5 P) Tcont ^ y :0^ d<p(yF^(T/(x)))} . 

Here, A'cont = {a; G fb : Px{{x}) = 0} and dcj) denotes the subdifferential of (j). 
In the simple case of a function of one variable d(j){x) = [4>'_{x),(j)\{x)], where 
(j)'_ and (|)'_^_ are the left and right hand derivatives of (j) (which always exist for 
convex functions). If Xcont = X, one can write P{S) as 

P{S) = Ep[J(o^a0(yF;(,)(x)))] 

= ^x[ll{x)I(^oifd4>(F^{ri(x)))) + (1 ~ V{x)I(0(fd4>{-F^{ri{x))))] 

= ExG{p{x)) . 

For the last step, we simply defined 

G{v) = vImd4>{F;G))) + (1 - P)hHd4>(-F;{ri))) ■ ( 4 ) 

3 Preliminary Results 

We will consider only classification calibrated convex loss functions. Since (p is 
classification calibrated we know that 4>'{0) < 0. Define to as 

to = inf{t : 0 G d4>{t)} 

with the convention that inf 0 = oo. Because 4>'{0) < 0 and subdifferentials of a 
convex function are monotonically decreasing, we must have to > 0. However, it 
may be that to = oo. The following lemma says that sparse solutions cannot be 
expected if that is the case. 
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Lemma 1. If to = oo, then G{rj) = 1 on [0, 1]. 

Proof, to = oo implies that for all t, 0 ^ d(p{t). Using (4), we get G{r]) = 
rj.l + (1 — ?7).l = 1. □ 

Therefore, let us assume that to < oo. The next lemma tell us about the signs 
of </>'_(to) and (j>'+{to). 

Lemma 2. If to < oo, then (p'_{to) < 0 and > 0- 

Proof. Suppose 4>'_{to) > 0. This implies d(j){to) > 0. Since subdifferential is a 
monotone operator, we have d(f>{t) > 0 for all t > to. By definition of to, 0 ^ d(j){t) 
for t < to. Thus, {t : 0 G d(j){t)} = 0, which contradicts the fact that t < oo. 
Now, suppose that </>+(to) = such that e > 0. Since limt/j,^ C(t') = ^to) 
(see [6], Theorem 24.1), we can find a t' > to sufficiently close to to such that 
< ~£/2- Therefore, by monotonicity of the subdifferential, d(j){t) < 0, for 
all t < t'. This implies t' < inf{t : 0 G d(j){t)}, which is a contradiction since 
t' > to. □ 

The following lemma describes the function F^{r]) near 0 and 1. Note that we 
have (j)'_{—to) < <('+(— to) < <(''(0) < 0. Also </>'(0) < (j>'_{to) < 0. 

Lemma 3. to G F^{rf) iff r] G [1 — 7, 1], where 7 is defined as 

^ (j>'-{to) 

^ (j)'-{to) + 

Moreover, F^{rf) is the singleton set {to} for 77 G (1 — 7, 1). 

Proof, to G F^{r]) to minimizes G{rj,-) 0 G d 2 G{rj,to), where 82 denotes 

that the subdifferential is with respect to the second variable. This is because 
G{rj, •), being a linear combination of convex functions, is convex. Thus, a neces- 
sary and sufficient condition for a point to be a minimum is that the subdiffer- 
ential there should contain zero. Now, using the linearity of the subdifferential 
operator and the chain rule, we get 

d 2 G{rj,to) = vd(j){to) - (1 - 'q)d(j){-to) 

= [ri4>'-{to) - (1 - ri)4>'+{-h),v4‘'+{to) - (1 - 'n)4>'_{-to)] . 
Hence, 0 G d 2 G{rj,to) iff the following two conditions hold. 

77<{''_(to) - (1 - ??)0'+(“^o) < 0 (5) 

7?0'+(^o) - (1 - ?7)</'-(-fo) > 0 (6) 

The inequality (6) holds for all 7 G [0, 1] since </>+(to) > 0 and (j)'_{—to) < 0. The 
other inequality is equivalent to 

-(/>'+ (-to) 

^ ~ -4>'-{to) - 4)'+{-to) ' 

Moreover, the inequalities are strict when r] G (1 — 7,1). Therefore, to is the 
unique minimizer of G{rj,-) for these values of 77. □ 
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Corollary 4 . —to £ F^{v) ^ G [ 0 > 7 ]- Moreover, F^{rj) is the singleton set 

{-to} for 77 G (0,7). 

Proof. Straightforward once we observe that F ^{1 — rj) = —F^fq). □ 

The next lemma states that if F^fqi) and F^{q2) intersect for qi ^ 72 then </> 
must have points of non-differentiability. This means that differentiability of the 
loss function ensures that one can uniquely identify q via any element in (q) . 



Lemma 5 . Suppose q\ yf 72 and qi,q2 G (7 , 1 — 7). Then F^{qi) n F^fq^) yf 0 
implies that 

— F^fqi) n F^{q2) is a singleton set (= {t} say). 

— (j> is not differentiable at one of the points t, —t. 

Proof. Without loss of generality assume qi >772- Suppose t > t' and t,t' G 
F^(77i)nF^(772). This contradicts the fact that is monotonic since t' G F^{qi), 
t G ^^(772) and t' < t. This establishes the first claim. To prove the second 
claim, suppose F^{qi) fl ^^(772) = {t} and assume, for sake of contradiction, 
that (j) is differentiable at t and —t. Since 771,772 G (7,1 — 7), Lemma 3 and 
Corollary 4 imply that t yf ±to- Therefore, t G {—to, to) and (f' {t) , {—t) > 0. 

Also, t G F^(t7i) n ^^(772) implies that 

m 4 ''{f) - (1 - qi)(j)'{-t) = 0 

- (1 - ?72)(/''(-t) = 0 . 

Subtracting and rearranging, we get 

{(j>'{t) + 4 >'{-t)){qi - 772) = 0 

which is absurd since 771 > 772 and (f {f) , (f {—f) >0. □ 



Theorem 6. Let f he an classification calibrated convex loss function such that 
to = inf{t : 0 G d(j>{t)} < 00. Then, for G{q) as defined in (4), we have 



G{v) 



1 77 G (7, 1 - 7) 

min{?7, 1-77} 77 G [0, 7] U [1 - 7, 1] 



where 7 = <j)'_{to) / {(p'_{to) + <j)'+{-to)). 



( 7 ) 



Proof. Using Lemmas 2 and 3, we have 0 G d(j){Ff,{q)) for 77 G [1 — 7,!]. If 
77 < 1 — 7, Lemma 3 tells us that to ^ Ff^{q). Since is monotonic, Ff^{q) < to. 
Since to = inf{f : 0 G d(j){t)}, 0 ^ d(j>{Ff^{q)) for q G [0, 1 — 7). Thus, we can write 
hH 94 >(F;G))) Also I( 0 (f:d,t>{-F;(v)) = kHd<t>{F;(i-v))- Plugging this 

in (4), we get 



G{q) — 77/(^^[i_^^i]) -I- (1 - q)I(i-r,^[i-^^i]) 

= 77/(^^[i_^^i]) -I- (1 - ?7)7(r,y[0,7]) ■ 

Since 7 < 1/2, we can write G{q) in the form given above. □ 
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Corollary 7. If rji G [0, 1] is such that F^{r]i) fl F^{r]) = 0 for r] ^ rji, then 
G{rf) = l on [min{?7i, 1 - 771}, max{r7i, 1 - 771}]. 

Proof. Lemma 3 and Corollary 4 tell us that rji G (7, 1 — 7). Rest follows from 
Theorem 6. □ 

The preceding theorem and corollary have important implications. First, we can 
hope to have sparseness only for values of 77 G [0,7] U [1 — 7, 1]. Second, we 
cannot estimate conditional probabilities in these two intervals because F^(-) is 
not invertible there. Third, any loss function for which F^(-) is invertible, say at 
771 < 1/2, will necessarily not have sparseness on the interval [771, 1 — 771]. 

Note that for the case of LI and L2-SVM, 7 is 1/2 and 0 respectively. For 
these two classifiers, the lower bounds ExG(f](x)) obtained after plugging in 7 
in (7) are the ones proved initially [7]. For the Ll-SVM, the bound was later 
significantly improved [8]. This suggests that ExG(r/(x)) might be a loose lower 
bound in general. In the next section we will show, by deriving sharp improved 
bounds, that the bound is indeed loose for a family of loss functions. 



4 Improved Bounds 

We will consider convex loss functions of the form 

(f(t) = h((to-t)+) (8) 

The function h is assumed to be continuously differentiable and convex. We 
also assume h'(0) > 0. The convexity of (p requires that h'{ 0 ) be non-negative. 
Since we are not interested in everywhere differentiable loss functions we want 
a strict inequality. In other words the loss function is constant for all t > to 
and is continuously differentiable before that. Further, the only discontinuity in 
the derivative is at to- Without loss of generality, we may assume that h{ 0 ) = 0 
because the solutions to (1) do not change if we add or subtract a constant from 
4 >. Note that we obtain the hinge loss if we set h{t) = t. We now derive the dual 
of (1) for our choice of the loss function. 



4.1 Dual Formulation 

For a convex loss function (p{t) = 

argminAjl 



h{{to—t)+), consider the optimization problem: 

n 

■^11^ + “ X! ■ (9) 



2 = 1 

Make the substitution = to — yiuFxi to get 

1 " 

argminAllicf -b - - Ci) (10) 

™ n ^ ' 

2 = 1 

subject to fi = to — yiVj^Xi for all i . 



( 11 ) 
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Introducing Lagrange multipliers, we get the Lagrangian: 



L(w, a) = A||u;||^ + “ X! ~ CO + X! - CO ■ 



Minimizing this with respect to the primal variables w and ^i’s, gives us 



i—1 

a* G -d(j>{to - ^i)/n . 

For the specific form of 4> that we are working with, we have 

[ {h'{S,i)/n} > 0 

-9(/)(to - ^i)/n = < [0, h'{Q)/n] Ci = 0 

[{ 0 } ii<Q. 

Let (tc*,C*) be a solution of (10). Then we have 






I n 1 ^ 



4.2 Asymptotic Fraction of Support Vectors 

Recall that a kernel is called universal if its RKHS is dense in the space of 
continuous functions over X . Suppose the kernel k is universal and analytic. This 
ensures that any function in the RKHS H of k is analytic. Following Steinwart [8], 
we call a probability distribution P non-trivial (with respect to (j)) if 

i?0,P < inf i?0_p(6) . 



We also define the P- version of the optimization problem (1): 
fp^x = argminA||/||^ + Ep(j){yf{x)) . 

Further, suppose that K = sup{iyA:(x, a;) : a; G A} is finite. Fix a loss function 
of the form (8). Define G{rj) as 

{ v/l 0 < ?7 < 7 

1 7<?7<l-7 

(l-7?)/7 l-7<r?<l 
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where 7 = ft-'(0)/(/i'(0) + /i'(2io))- Since (f) is differentiable on {—to, to), Lemma 5 
implies that is invertible on ( 7 , 1 — 7 ). Thus, one can estimate conditional 
probabilities in the interval ( 7,1 — 7 ). Let #S'F(/t,a) denote the number of 
support vectors in the solution ( 2 ): 

#SV{fT,x) = \{i:a*^0}\ . 

The next theorem says that the fraction of support vectors converges to the 
expectation E,xG{rj{x)) in probability. 

Theorem 8 . Let H he the RKHS of an analytic and universal kernel on 
Further, let X be a closed ball and P be a probability measure on X x {±1} 

such that Px has a density with respect to the Lehesgue measure on X and P is 
non-trivial. Suppose sup{a/A:(x, x) : x G X} < 00. Then for a classifier based 
on (1), which uses a loss function of the form (8), and a regularization sequence 
which tends to 0 sufficiently slowly, we have 

#SV{fT,xJ . ^ 

^ ^ ExG{iq{x)) 

n 



in probability. 

Proof. Let us fix an e > 0. The proof will proceed in four steps of which the last 
two simply involve relating empirical averages to expectations. 

Step 1. In this step we show that fp,x„{x) is not too close to ±to for most 
values of x. We also ensure that fT,x„{x) is sufficiently close to fp^x„{x) provided 
A„ — >■ 0 slowly. Since fp^\ is an analytic function, for any constant c, we have 

Px{{x G X : fp,x{x) = c}) > 0 ^ f{x) = c Px-a,.s. (16) 

Assume that Px{{x G X : fp^x{x) = to}) > 0- By (16), we get Px{{x G X : 
fp,x{x) = to}) = 1- But for small enough A, /p,a ^ to since Rcj,,p{fp,x) R<f,,p 
and R{to) yf Rif>,P by the non-triviality of P. Therefore, assume that for all 
sufficiently large n, we have 



PA{x & X ■■ fp,xA^) = to}) = 0 . 



Repeating the reasoning for —to gives us 

Px{{x G A : |/p,A„(a;) - to| < <5}) } 0 as i 0 

Px{{x G A : |/p,A„(x) + to| < (5}) i 0 as (5 i 0 . 

Define the set A^(A) = {x G A : |/p,a(x) — to| < <5 or |/p,a(x) + to| < <5}. For 
small enough A and for all e > 0, there exists d > 0 such that Px{As{\)) < e. 
Therefore, we can define 

5(A) = ^sup{5 > 0 : Px{As{X)) < e} . 
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Let to(A) = inf{(5(A') : A' > A} be a decreasing version of <5(A). Using Proposition 
33 from [7] with e = m(A„), we conclude that for a sequence A„ — >■ 0 sufficiently 
slowly, the probability of a training set T such that 

||/t,a„ - /p,A„|l < rn{Xn)/K (17) 

converges to 1 as n — >■ oo. It is important to note that we can draw this conclusion 
because m(A) > 0 for A > 0 (See proof of Theorem 3.5 in [8]). We now relate 
the 2-norm of an / to its oo-norm. 

f{x) = {k{x, ■),/{■)) < ||fc(x,-)|| ll/ll 

= V {Hx,-),Hx,-))\\f\\ ( 18 ) 

= Hx,x)\\f\\ < K\\f\\ 

Thus, (17) gives us 

||/t,A„ — /p,A„||oo < w(A„) . (19) 

Step 2. In the second step, we relate the fraction of support vectors to an 
empirical average. Suppose that, in addition to (19), our training set T satisfies 

An||/T,A„ IP + R<j),p{fT,X„) < Rct>,p + e (20) 



|{i : Xi G A5 (a„)}| < 2en . (21) 

The probability of such a T also converges to 1. For (20), see the proof of 
Theorem III. 6 in [9]. Since Pa;(^5(A„)) < e, (21) follows from Hoeffding’s in- 
equality. By definition of we have 7?0,p < .R0 ,p(/t,a„)- Thus, (20) gives 

us A„||/t^a„IP < e- Now we use (15) to get 



1 ] 

2=1 



2=1 



< 2e 



( 22 ) 



Define three disjoint sets: A = {i : < Q}, B = {i : = 0} and C = {i : > 

0}. We now show that B contains few elements. If Xi is such that i € B then 
^* = 0 and we have yifT,\„{xi) = to ^ fT,\„{xi) = ±to- On the other hand, if 
Xi i 4l5(A„) then min{\fp^x„{xi) - to|, |/p,A„(a:i) -I- <o|} > <5(A„) > m(A„), and 
hence, by (19), fT,\„{xi) yf ±to- Thus we can have at most 2en elements in the 
set B by (21). Equation (14) gives us a bound on a* for i £ B and therefore 



^a*to 

ieB 



< 2en X h'{0)to/n 



2h'{0)toe . 



(23) 



Using (14), we get = 0 for t G A. By definition of B, = 0 for t G B. 
Therefore, (22) and (23) give us 



iGC iGC 



< 2(1 -I- h'{0)to)e = cie . 
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where ci = 2(1 + h'{0)to) is just a constant. We use (14) once again to write a* 
as h'{^*)/n for i G C: 



i^C 






i^C 



< c\e . 



(24) 



Denote the cardinality of the sets B and C by Nb and Nc respectively. Then 
we have Nq < #SV{fT,\„) < Nq + Nb- But we showed that Nb < 2en and 



therefore 

n ~ n ~ n 



(25) 



Observe that (^*)+ = 0 for i G A U B and ($*)+ = for i G C. Thus, we can 
extend the sums in (24) to the whole training set. 



'-±h'USMto-(n-NN-^ 

n n 



1 "" 
n 



< cie 



Now let C 2 = ci//i'(0)to and rearrange the above sum to get 



n 




fe'((g)+)to-fe'((g*)+)(g), 

/i'(0)to 



< C2£ . 



Define g(f) as 



9(f) = 1 - 



/i'((to 



t)+)tp - h'{{tp - t)+){tp - t)+ 
h'{0)tp 



Now (26) can be written as 



(26) 



Nc 

n 



^T9{yfT,xAx)) 



< C2C . 



(27) 



Step 3. We will now show that the empirical average of g{yfT,\n{x)) is close 
to its expectation. We can bound the norm of fT,\„ as follows. The optimum 
value for the objective function in (1) is upper bounded by the value it attains 
at / = 0. Therefore, 

An||/r,A„ IP + R(j>,T{,fT,xf) < A„. 0^ + i?0_T(O) = ^(0) = h{tp) 
which, together with (18), implies that 



II/t,a„|| < 





(28) 



||/t,A„||oo < K 



(29) 
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Let T\^ be the class of functions with norm bounded by \/h{tQ)l\n- The cov- 
ering number in 2-norm of the class satisfies (see, for example, Definition 1 and 
Corollary 3 in [11]): 



Kh(to) 



log(2ri-|-l) 



(30) 



Define Lg{\n) as 



Tg(A„) = sup 



\9{t)-g{t')\ 



tjt' €: 



-K 






(31) 



Let = {{x,y) !->• g{yf{x)) : f G ^a„}- We can express the covering numbers 
of this class in terms of those of (see, for example. Lemma 14.13 on p. 206 
in [1]): 

Af2(0A„,e,n) <AT2(^A„,e/Lg(A„),n) . (32) 

Now, using a result of Pollard (see Section II.6 on p. 30 in [5]) and the fact that 
1-norm covering numbers are bounded above by 2-norm covering numbers, we 
get 



P" Tg(Tx 3^)": sup |ETKcr, 2 /)-Epg(a:,y)| >e 

V / 

The estimates (30) and (32) imply that if 

nXi 



(33) 



p4(A„)log(2n-h 1) 



oo as n — >■ oo 



then the probability of a training set which satisfies 

\^Tg{yfT,\„ix)) -Epg{yfT,\„(x))\ < e 
tends to 1 as n — >■ oo. 



(34) 



Step 4. The last step in the proof is to show that Epg{yfT,\^{x)) is close to 
ExG{r]{x)) for large enough n. Write Epg{yfT,\„{x)) as 

^pg{yfT,\A^)) = ^x[v{x)g{fT,\„{x)) + (1 - v{x))g{-fT,\„{x))] ■ 

Note that if t* G F^{r]) then 

T]g{t*) + {1- v)g{-t*) = G{g) . (35) 

This is easily verified for g G [0,7] U [1 — 7,1] since g{t) = 0 for t > to and 
g{—to) = 1/7. For 77 G (7, 1 — 7) we have 

gg{t*) + {1- v)g{-t*) = 1 - {gh'{to - t*) - (1 - g)h'{to + t*)) . 
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Since t* minimizes rjh(to — t) + (1 — rj)h{to + t) and h is differentiable, we have 
r]h'{to — t*) — (1 — r])h'{to + t*) = 0. Thus, we have verified (35) for all rj G [0, 1]. 
Define the sets En = {x G X : pifT,\„{x), F^{r]{x)) > e}. We have Px{E„,) — >■ 0 
by (3). We now bound the difference between the two quantities of interest. 

I Epg(y/T,A„(a:)) - E^G{r]{x)) \ 

= \^x[v{x)9{fT,\„{x)) + {^-v{x))g{-fT,\Ax))] ~'ExG{t]{x)) I 

< E,, I ’n{x)g{fT,xAx)) + (1 - v{x))g{-fT,xAx)) ~ Gipix)) \ 

= II + I 2 < |.^l| + 1-^21 

where the integrals I\ and I 2 are 

h= f v{x)g{fT,x„{x)) + (1 - g{x))g{-fT,x„{x)) - G{rj{x)) dP^ (37) 

h= f T]{x)g{fT,xAx)) + {^-v{x))g{-fT,x„{x))-G{T]{x))dP^ . (38) 

Jx\E^ 

Using (29) and (31) we bound |g(±/T,A„(a;))| by g(0) + Lg{X„)Ky/h'{to)/Xn- 
Since 5(0) = 1 and |G(?7)| < 1, we have 



|/i| < 



+ ( 7 ( 0 ) + Lg{Xn)K 




Px{En) ■ 



If A„ — >■ 0 slowly enough so that Lg{Xn)Px{En) /x/^ Oj then for large n, 

|U| < £• To bound I/ 2 I, observe that for x G X\En, we can find a t* G F^{r]{x)), 
such that |/T,A„(a;) — t*\ < e. Therefore 

v{x)g{fT,Xr,{x)) + (1 - v{x))g{-fT,x„{x)) 

= g(x)g{t*) + {I - g{x)g{-t*) + A . (39) 

where |Z\| < 036 and the constant C 3 does not depend on A„. Using (35), we can 
now bound |/ 2 |: 

1 .^ 2 ! < C3C(1 - Poo{En)) < c^e . 

We now use (36) to get 

I T^pg{yfT,x„{x)) - Eo;G{t]{x)) I < (c 3 + l)e . (40) 

Finally, combining (25), (27), (34) and (40) proves the theorem. □ 



5 Conclusion 

We saw that the decision functions obtained using minimization of regularized 
empirical ^risk approach (??(•))• It is not possible to preserve sparseness on 
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intervals where F^(-) is invertible. For the regions outside that interval, sparse- 
ness is maintained to some extent. For many convex loss functions, the general 
lower bounds known previously turned out to be quite loose. 

But that leaves open the possibility that the previously known lower bounds 
are actually achievable by some loss function lying outside the class of loss func- 
tions we considered. However, we conjecture that it is not possible. Note that the 
bound of Theorem 8 only depends on the left derivative of the loss function at 
to and the right derivative at —to- The derivatives at other points do not affect 
the asymptotic number of support vectors. This suggests that the assumption 
of the differentiability of (f> before the point where it attains its minimum can 
be relaxed. It may be that results on the continuity of solution sets of convex 
optimization problems can be applied here (see, for example, [3]). 



Acknowledgements. Thanks to Grace Wahba and Laurent El Ghaoui for help- 
ful discussions. 
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Abstract. The Gram matrix plays a central role in many kernel meth- 
ods. Knowledge about the distribution of eigenvalues of the Gram matrix 
is useful for developing appropriate model selection methods for kernel 
PCA. We use methods adapted from the statistical physics of classical 
fluids in order to study the averaged spectrum of the Gram matrix. We fo- 
cus in particular on a variational mean-field theory and related diagram- 
matic approach. We show that the mean-field theory correctly reproduces 
previously obtained asymptotic results for standard PCA. Comparison 
with simulations for data distributed uniformly on the sphere shows that 
the method provides a good qualitative approximation to the averaged 
spectrum for kernel PCA with a Gaussian Radial Basis Function kernel. 
We also develop an analytical approximation to the spectral density that 
agrees closely with the numerical solution and provides insight into the 
number of samples required to resolve the corresponding process eigen- 
values of a given order. 



1 Introduction 

The application of the techniques of statistical physics to the study of learning 
problems has been an active and productive area of research [1]. In this contribu- 
tion we use the methods of statistical physics to study the eigenvalue spectrum 
of the Gram matrix, which plays an important role in kernel methods such as 
Support Vector Machines, Gaussian Processes and kernel Principal Gomponent 
Analysis (kernel PGA) [2]. We focus mainly on kernel PGA, in which data is 
projected into a high-dimensional (possibly infinite-dimensional) feature space 
and PGA is carried out in the feature space. The eigensystem of the sample co- 
variance of feature vectors can be obtained by a trivial linear transformation of 
the Gram matrix eigensystem. Kernel PGA has been shown to be closely related 
to a number of clustering and manifold learning algorithms, including spectral 
clustering, Laplacian eigenmaps and multi-dimensional scaling (see e.g. [3]). 



J. Shawe-Taylor and Y. Singer (Eds.): COLT 2004, LNAI 3120, pp. 579—593, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The eigenvalue spectrum of the Gram matrix is of particular importance in 
kernel PCA. In order to find a low-dimensional representation of the data, only 
the eigenvectors corresponding to the largest few eigenvalues are used. Model 
selection methods are required in order to determine how many eigenvectors to 
retain. For standard PCA it is instructive to study the eigenvalues of the sample 
covariance for idealised distributions, such as the Gaussian Orthogonal Ensemble 
(GOE), in order to construct appropriate model selection criteria [4]. For kernel 
PCA it would also be instructive to understand how the eigenvalues of the Gram 
matrix behave for idealised distributions, but this is expected to be significantly 
more difficult than for standard PCA. In this paper we present some preliminary 
results from an analysis of Gram matrix eigenvalue spectra using methods from 
statistical mechanics. 

In a recent paper we studied the case of PCA with non-isotropic data and 
kernel PCA with a polynomial kernel function [5]. In the case of a polynomial 
kernel, kernel PCA is equivalent to PCA in a finite-dimensional feature space 
and the analysis can be carried out explicitly in the feature space. We presented 
numerical evidence that an asymptotic theory for standard PCA can be adapted 
to kernel PCA in that case. In contrast, here we consider the more general case 
in which the feature space may be infinite dimensional, as it is for the popular 
Gaussian Radial Basis Function (RBF) kernel. In this case it is more useful to 
carry out the analysis of the Gram matrix directly. We review some different 
approaches that have been developed in the physics literature for the analysis 
of the spectra of matrices formed from the positions of particles randomly dis- 
tributed in a Euclidean space (Euclidean Random Matrices), which are related 
to the instantaneous normal modes of a classical fluid. We focus in particular on 
a variational mean-field approach and a closely related diagrammatic expansion 
approach. The theory is shown to reproduce the correct asymptotic result for 
the special case of standard PCA. For kernel PCA the theory provides a set of 
self-consistent equations and we solve these equations numerically for the case 
of data uniformly distributed on the sphere, which can be considered a simple 
null distribution. We also provide an analytical approximation that is shown to 
agree closely with the numerical results. Our results provide insight into how 
many samples are required to accurately estimate the eigenvalues of the associ- 
ated continuous eigenproblem. We provide simulation evidence showing that the 
theory provides a good qualitative approximation to the average spectrum for a 
range of parameter values. 

The Gram matrix eigenvalue spectrum has previously been studied by Shawe- 
Taylor et. al. who have derived rigorous bounds on the difference between the 
eigenvalues of the Gram matrix and those of the related continuous eigenprob- 
lem [6]. The statistical mechanics approach is less rigorous, but provides insight 
into regimes where the rigorous bounds are not tight. For example, in the study 
of PCA one can take the asymptotic limit of large sample size for fixed data 
dimension, and in this regime the bounds developed by Shawe- Taylor et. al. can 
be expected to become asymptotically tight [7]. However, other asymptotic re- 
sults for PCA have been developed in which the ratio of the sample size to data 
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dimension is held fixed while the sample size is increased (eg. [8,9]). Our results 
reduce to the exact asymptotics of standard PCA in this latter regime, and we 
therefore expect that our methods will provide an alternative but complementary 
approach to the problem. 

The paper is organised as follows. In the next section we introduce some 
results from random matrix theory for the eigenvalues of a sample covariance 
matrix. These results are relevant to the analysis of PCA. We introduce kernel 
PCA and define the class of centred kernels used there. In section 3 we discuss 
different theoretical methods for determining the average Gram matrix spec- 
trum and derive general results from the variational mean-field method and a 
related diagrammatic approach. We then derive an analytical approximation to 
the spectrum that is shown to agree closely with numerical solution of the mean- 
field theory. In section 4 we compare the theoretical results with simulations and 
in section 5 we conclude with a brief summary and discussion. 



2 Background 

2.1 Limiting Eigenvalue Spectrum for PCA 

Consider a data set of p-dimensional data vectors Xi,i = 1, . . . ,fV, with mean 
X. A number of results for the asymptotic form of the sample covariance matrix, 
C = p~^ ~ (®i ~ have been derived in the limit of large p with the 

ratio a = N/p held fixed We will see later that these results are closely related 
to our approximate expressions for kernel PCA. 

We denote the eigenvalues of the sample covariance matrix C as A^, i = 
1, . . . ,p. The eigenvalue density p{\) can be written in terms of the trace of the 
sample covariance resolvent. 



R^{z) = N-^Tr{zI - C)-^ 



1 ^ 



2 = 1 



-A. 



( 1 ) 



The eigenvalue density is obtained from the identity 

a~^ p{\) = lim — Imi?^(A — le) . 

e->0+ 7T ^ 



( 2 ) 



This is the starting point for a number of studies in the physics and statistics 
literature (e.g. [8,10]) and the function —R^{z) is also known as the Stieltjes 
transform of the eigenvalue distribution. As A, p — >■ oo the density p(A) is self- 
averaging and approaches a well defined limit. It has been shown, with relatively 
weak conditions on the data distribution, that as A — >■ oo with a fixed [8,9] 




^ The notation differs from our previous work [5,12] since A is more often used for 
the number of data points in the machine learning literature. 
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where Ai are the eigenvalues of the population covariance, or equivalently they 
are the eigenvalues of the sample covariance in the limit of infinite data, i.e. 
Ai — >■ Ai as a — >■ oo. For non-Gaussian data vectors with i.i.d. components this 
result has been shown to hold so long as the second moments of the covariance 
exist [9], while other results have been derived with different conditions on the 
data (e.g. [8]). An equivalent result has also been derived using the replica trick 
from statistical physics [10], although this was limited to Gaussian data. 

The solution of the Stieltjes transform relationship in eq.(3) provides insight 
into the behaviour of PGA. Given the eigenvalues of the population covariance, 
eq.(3) can be solved in order to determine the observed distribution of eigenvalues 
for the sample covariance. For finite values of a, the observed eigenvalues are 
dispersed about the true population eigenvalues and significantly biased. One can 
observe phase-transition like behaviour, where the distribution splits into distinct 
regions with finite support as the parameter a is increased, corresponding to 
signal and noise [5,11]. One can therefore determine how much data is required 
in order to successfully identify structure in the data. We have shown that the 
asymptotic result can be accurate even when N p, which may often be the case 
in very high-dimensional data sets [12]. It is also possible to study the overlap 
between the eigenvectors of the sample and population covariance within the 
statistical mechanics framework (see [13] and other references in [1]) but here 
we will limit our attention to the eigenvalue spectrum. 

2.2 Kernel PCA 

By construction, PGA only finds features that are linear combinations of the data 
vector components. Often one constructs higher dimensional non-linear features 
x(a?) from an input vector x in order to gain improved performance. In this case 
PGA can then be performed on the new high-dimensional feature vectors X- 
Given a data set Xi,i = 1, ... ,N , the feature space covariance matrix is = 
N~^ We have initially taken the sample mean vector in feature 

space, X, to be zero. Decomposition of can be done entirely in terms of the 
decomposition of the Gram matrix K, with elements Kij = x(^i) ' x(^j)- The 
“kernel trick” tells us that a suitably chosen kernel function k(x, y) represents 
the inner product, x(a;) • x(y), in a particular (possibly unknown) feature space. 
Thus PGA in the feature space can be performed by specifying only a kernel 
function k{x,y), without ever having to determine the corresponding mapping 
into feature space. This is kernel PGA [2]. 

For the kernel to represent an inner product in some feature space it must 
be symmetric, i.e. k(x,y) = k{y,x). Popular choices of kernel for fixed length 
vectors mostly fall into one of two categories: Dot-product kernels k{x, y) = 
k{x ■ y) and translationally invariant kernels k{x,y) = k(\\x — y||). A common 
and important choice for k{t) in this latter case is the Gaussian Radial Basis 
Function (RBF) kernel k{t) = exp(— 1^/26^). It should be noted that for data 
constrained to the surface of a sphere, a kernel of the form fc(®, y) = fc(||a; — y||^) 
is equivalent to a dot-product kernel. Standard PGA corresponds to a linear dot- 
product kernel k{x, y) = x ■ y. 
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In reality, the kernels above will not result in x = 0 for most data sets. In 
this case the sample covariance in feature space corresponds to N~^ ~ 

x)(x(xi) — x)’^ ■ The Gram matrix can be centred to produce a new matrix F 
with elements 

Fij = Kij - ^ X! X! • (4) 

3 i ij 

This is equivalent to transforming the feature vectors to have zero mean, but 
again F can be calculated with knowledge only of the kernel function k{x,y). 
It should be noted that ^ - Fij = Fij = 0, and so F always has a zero 
eigenvalue with corresponding eigenvector (1,1,. ..,1). If the pattern vectors Xi 
are drawn from some distribution Px{x) then as IV — >■ oo the matrix elements 
Fij can be considered a random sample produced by a kernel function, 

dp{x)J dp{y)k{x,y), 

(5) 

by sampling N pattern vectors Xi drawn from Px{x) (where the measure dp{x) 
denotes dxpx{x)). Clearly f dy{x)f{x,y) = f dp{y)f{x,y) = 0. 



f{x,y) = k{x,y)~ J dy{y) k{x,y)~ J dp{x) k{x,y) + j 



3 Statistical Mechanics Theory 

We now use Rf{z) to represent the ensemble averaged trace of the centred Gram 
matrix resolvent, i.e., 

Rpiz) = N-^FRzI - F)-^ = iV-i^logdet(zJ-F), (6) 

az 

where the bar denotes an ensemble average over data sets. The expected eigen- 
value density can then be obtained from Rp{z) as in eq.(2). The replica trick 
of using the representation logx = lim„_>o n“^(x” — 1) is used to facilitate the 
evaluation of the expectation of the log of the determinant in (6). Following 
Mezard et. al. [14] we define S’a? = det{zl — and then take the replica 

limit n — >■ 0. Using a Gaussian integral representation of the square root of the 
determinant one finds, 

N n / 

i—1 a—1 \ i^a 

(7) 

where f{x,y) is the kernel function. We introduce the field tpa{x) = 
4’id{x — Xi) and its conjugate ij^aix) which serves as a Lagrange multi- 
plier. Moving to the grand canonical partition function Z{uj) = /N\, 

we find (after some straightforward algebra and Gaussian integrations), 

= j Y[viia exp iwA - ^Trlog/-f ^ ^ Jdxdy ■$a{x)f-\x,y)^a{y) 
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= f Y[Vi^a exp (^S[ip]j , (8) 

a 

where / is considered an operator and we have introduced, 

-SE’SW) . (9) 

o / 

With uj{A) = ojdlogZ{u})/duj = N and (A) — >• 1 as n — >■ 0 we take uj = N in 
the limit n — >■ 0. Having obtained a formal expression for the grand canonical 
partition function we have a number of avenues available to us, 

— Asymptotically expanding Z{uj) using u as either a small or large parameter 
(low & high density expansions). 

— Variational approximation, such as the Random Phase Approximation 
(RPA) used by Mezard et. al. [14], to provide a non-asymptotic approxi- 
mation to the expected eigenvalue density. 

— Formal diagrammatic expansion of Z(lo) to elucidate the nature of the var- 
ious other approximations. 

— Density functional approaches, appropriate for approximating the behaviour 
of the expected spectra when the input density is inhomogeneous, i.e. not 
uniform over the sphere or some other analytically convenient distribution. 

In this paper we shall focus on the second and third approaches. The results 
in [14] suggest that the variational approach will reduce to the low and high 
density expansions in the appropriate limits. We leave the fourth approach for 
future study. 

3.1 Variational Approximation 

We now make a variational approximation to the grand canonical free energy, 
F = — log Z{uj), by introducing the quadratic action, 

Srpa = [dxdyiia{x)G~i^{x,y)ipb{y) , (10) 

a,b 

with corresponding partition function, 

ZrpA = f Y\_'^^aexp (^SRPA[lp]j . ( 11 ) 

The grand canonical free energy satisfies the Bogoliubov inequality, 

F < Fyar = {S — Srpa) — log Zrp A , (12) 

where the brackets denote an integral with respect to ip. We then proceed by 
minimising this upper bound. 



A= - 



(?) 



dy,{x) exp 
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Define eigenfunctions of the kernel / with respect to the measure /r(a:), 

J dfj,{x) f{x,y)q m(^) — <?m(y) • (13) 

We can represent both / and its inverse f~^ over in terms of the eigenfunc- 
tions qm{x), 

f{x,y) = ^><7nq*„r{x)qm{y) , f~\x,y) = ^p^{x)q*^{x)q^{y)p^{y)/\^ . 

m m 

(14) 

Writing the variational free energy solely in terms of the propagator G (and 
dropping irrelevant constant terms) we obtain, 

Fvar = fdx fdy f~^{x,y)Gaa{y,x) - ^TrlogG 



27 t \ ^ f 

— j /d/x(x)exp(-lTr„log(J-z“^G(a7,x))) . (15) 

Here Tr represents a trace over both and replicas, whilst Trn represents a 
trace only over replicas. The variational free energy is minimized by setting. 



/27t\ ^ 

W f — j i5(a; -y)pa,(a;) exp (-lTr„ log (J-z"^G(a;, a:))) 



X ^ ^ ^G(a7,a?)) ^ 



(16) 



' ba 



Looking for a solution of the form Gab{x, y) = SabG{x, y) and taking n — >■ 0 we 
find 



1 



/ {x,y) - N5{x-y)p^{x)~ 

z — G(x, X) 

with the resolvent given as, 

Rf{z) = J dp{x) 



- G (x,y) = 0 , (17) 



1 



z — G{x, x) 



(18) 



If we write G ^{x,y) = f ^{x,y) — 5{x — y)px{x)h{x) then we have. 



Rf{z) ~ ^ ^ j dxpx{x)h{x) with h{x) = 



N 

z — G{x, x) 



(19) 



Closed form solution of these self-consistent equations is not possible in general 
and they would have to be solved numerically. 
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A useful approximation is obtained by replacing G{x,x) in h{x) by its aver- 
age G = J dx px{x)G{x, x), from which we find G = ^m[^ — NRF{z)Xm]~^ ■ 

Using this approximation and substituting into h{x) we obtain the Stieltjes 
transform relationship in eq.(3) for Rc^ - the trace of the resolvent of the fea- 
ture space covariance matrix C^. This result can also be obtained by expanding 
Gab{x,y) in terms of the eigenfunctions qm{x) and assuming replica symmetry, 
i.e. writing Gab{x,y) = ^^r]mqm{x)qm{y) ■ Minimizing the variational free en- 
ergy with respect to the coefficients rjm and using Jensen’s inequality one again 
obtains the relationship in eq.(3). This was the approach taken by Mezard et. 
al. [14]. They considered the case where the kernel is translationally invariant 
and the input density Px(x) is uniform, and in this case the Stieltjes transform 
relationship (3) represents an exact solution to the variational problem. In the 
simulation results in section 4 we consider a dot-product kernel with data dis- 
tributed uniformly on the sphere, and in this case eq.(3) again represents an 
exact solution to the variational problem. However, our derivation above shows 
that eq.(3) is actually also an approximate solution to a more general varia- 
tional approach. We will see that this more general variational result can also be 
derived from the diagrammatic approach in the next section. In section 3.3 we 
show how to solve the relationship in eq.(3) for the specific case of data uniformly 
distributed on the sphere. 



3.2 Diagrammatic Expansion 



The partition function Z{ui) in eq.(8) can be expanded in powers of loA. The 
exponential form for each occurrence of A (see eq.(9)) can also be expanded 
to yield a set of Gaussian integrations. These can be represented in standard 
diagrammatic form and so we do not give the full details here but merely quote 
the final results. The free energy log Z{uj) only contains connected diagrams [15]. 
Thus we find, 

log Z{uj) = • -I- -I- + ■ • • 



A node • represents integration with weight to{ — )'^Px{x). The connecting lines 
correspond to a propagator z~^ f{x,y) and all diagrams have an additional 
weight n. 

From this expansion of log2^(o;) a diagrammatic representation of the resol- 
vent Rf{z) can easily be obtained on making the replacement u = N. Diagrams 
with articulation points can be removed by re-writing Rf{z) = N~^ f dp(x)h(x) 
and h(x) = N /\z — G{x, a:)], where G{x, x) is given by 
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where in all diagrams for G(x^ x) a connecting line now represents a propagator 
f{x,y) and a node • represents an integration with weight px(x)h(x). If we 
re-sum only those diagrams in G{x, x) consisting of simple loops we recover the 
RPA and the relationships given by eqs. (17) and (19). 



3.3 Solution of the Stieltjes Transform Relationship 

From the variational approximation described in section 3.1 we obtained the 
Stieltjes transform relationship given by eq.(3). Solving the relationship requires 
knowledge of the eigenvalues of the centred kernel. In this section we develop an 
approximate analytical solution to eq.(3) and illustrate it for the special case of 
data uniformly distributed on the sphere, which is a useful null distribution for 
kernel PC A. 

For illustrative purposes we will restrict ourselves to dot-product kernels, in 
which case we have a centred kernel. 



f{x -y) =k{x-y) - C , (20) 

where G = dt(l — The Hecke-Funk theorem tells us that the 

eigenfunctions qn(x) of k(t) are Spherical Harmonics [16]. The (non-zero) 
eigenvalue has degeneracy N{p, n) and is found to be. 






npm 



^\2j r{{2n+p-l)/2) 

2n + p — 2 (n + p—i 



/ I / rl \ ^ 

dt(l-f2)(2"+P-3)/2 ( « j (21) 

- 1 \ / 



N{p,n) = 

U 

For the Gaussian RBF kernel we have. 



k{x,y) = exp ( -^\\x-y\\^ 



n — 1 



= exp ( ) exp 



(X ■ y\ 

V 62 J > 



( 22 ) 



(23) 



for x,y on the unit sphere. The centred kernel is easily found to be, f(x,y) — 
k{x, y)—C, G = exp(— 6“^)r(|)(26^)5“^/|_i(6“^), where Iy{x) is the modified 
Bessel function of the first kind, of order v [17]. The n**' eigenvalue is given by. 




For p = 2 we have two-fold degenerate eigenvalues A„ = exp(— 6 ‘^)In{b ^), in 
agreement with Twining and Taylor [18]. 
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The density of observed eigenvalues is given by eq.(2) and is obtained from 
the imaginary part of the function Rc^{z) which solves eq.(3), 



2 = 



1 



1 N{p,n)Xn 

^ 1 



z 



Re z — ze , 



(25) 



in the limit e — >■ 0+. Here A„ and N{p,n) would be given by (24) and (22) 
for the RBF kernel. We can solve this equation numerically and results of the 
numerical solution for some specific examples are given in section 4. However, 
it is also instructive to obtain an analytical approximation for limiting cases 
as this provides us with greater insight. The approximate expression derived 
here appears to provide a good approximation to the numerical results in many 
cases. The expansion is perturbative in nature, and we recover the exact results 
for standard PCA in the limit of large N with a = N/p held fixed [8]. However, 
our simulations also show that the approximation works well for small values of 
P- 

In general as N increases (for fixed p) we would expect to recover the process 
eigenvalues A„ from the eigenvalues of the Gram matrix. We will see below that 
larger values of N are required in order to resolve the smaller eigenvalues. As 
N ^ oo we expect Rc^{z) to be become localized around each A^^. If we put 
and expand, we have 



0 = Afe(I - 4) - 



XkN{p, k) 
N6k 



+ + SkX),^ S 2 + , (26) 



where and S 2 are given by, 

^ _ I ^ N{p, n) 

n^k 



_ I ^ N{p, n) 
" N ^ Al^ 

n^k 



(27) 



and Ank = A„ ^ — A^ ^ . Dropping the higher order terms and solving for 6k gives. 



~{Xk + XJi — z) ± a/ (A fc + Si — z)^ + 4N [p, k)Xk{Xi^ ^ XJ 2 — Xk)/N 

6k = ^ . (28) 

2{Xl^S2-Xk) 



From (28) we can see that provided X\> S 2 , 6k will give a contribution to the 
Gram matrix eigenvalue density of. 



1 \J (A Xjnin,k){Xmax,k ^ 



27T 



Afc — Xf. ^^2 



X t9(A; Xjyiin^k-! Xjyiax.k) 5 (^^) 



where. 



X{max,min},k — Afc + Ai ± \J 4N {p, fc)A^(l — Xf, E 2 ) / N 
= Xk±2Xk(^^^^^y +0(iV-i), 



( 30 ) 
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and we have defined 0(\; Amin, ^max) = 0{Xmax ~ A)0(A - A™„) with 0{x) 
being the usual Heaviside step function. If we take \max,i as our estimate of 
the expectation of the largest observed eigenvalue then we can see that the 
fractional bias is 2y^iV(p, l)/N + 0{N~^). Thus for dot-product kernel with an 
input distribution that is uniform on the sphere the RPA estimates the fractional 
bias in the largest eigenvalue to be 2^/p/N + 0{N~^) and so is the same for all 
dot-product kernels. 

The inequality A^ > S 2 can always be satisfied for sufficiently large N, 
and therefore the population eigenvalue can be resolved provided N is large 
enough. It is easily verified that for the standard PCA case by taking population 
eigenvalues Ai = cr^(l -I- A), A„ = n = 2, . . . ,p, the known condition a = 
N/p > is recovered [13]. Unsurprisingly, the dispersion of p(A) about Xk 
decreases as N ^ 00 . 

The validity of the expansion in (26) depends upon the neglected terms in the 
expansion being considerably smaller in magnitude than those retained. Thus 
we require, 

141 < 1 , 141 < Afc|Z\„fc| Vn yf A: . (31) 



It is the second of these relations which is important and utilising the solution 
(28) for 4 we find. 



iv|4p 



N{p,k) 
l-X-^S2 ■ 



(32) 



For large p we have, 

N{p,n) = — (l-kO(p“^)) , 
n\ 



(33) 



and for the RBF kernel, 

A„ = exp (-&-") exp (l + 0(p-i)) , (34) 

from which we find, 

k 

max Afc|Z\„fc| -)> 1 , = -^ x (l -h C>(p“^)) asp ^ 00 . (35) 



The localized contribution to the Gram matrix eigenvalue density given by (29) 
is then valid provided p^ /Nk\ <C 1 which suggests that as p becomes large, 
N = O(p^) pattern vectors are required in order to resolve the fc**' population 
eigenvalue as a separate peak within the observed spectra of the Gram matrix. 

Let ko{N) denote the number of peaks that can be resolved for a given 
number of data points N. For the RBF kernel 4 ~ p~^ InN x 0(1). For some 
k > ko, dropping higher order terms in the expansion (26) is not valid. To obtain 
an approximation to the expected spectrum in the range A < Xk„ consider that 
since A^ — >■ 0+ as fc — >■ 00 then for sufficiently large k the kernel eigenvalue Xk 
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will be smaller than \Rc-x\- Thus for some integer kc we expand, 



N {p, k) 

h 



He 

E 






N{p,k) 



1 

^kRc^ 




— 1 oo 

+ ^AfciV(p,fc)(l 

k—kc-\-l 



Rc^^k) 



(36) 

Binomially expanding the two sums on the right-hand side of (36), retaining 
only the first term in the first sum and the first two terms in the second sum, 
and solving the resulting approximation for Rc^{z) yields a single bulk- like 
contribution (denoted by subscript B) to the density. 



1 

2^ 




A) X 0(A, ■! Amaa:,s) 



(37) 



with ^{max,min},B — T ^ 2 Z\ and, 

- oo - oo ^ kc 

7 =;^ '^N{p,k)Xk , A=— '^N{p,k)Xl , H = 1 - — '^N{p,k) . 

k—kc-\-l k—kc-\-l k—1 

(38) 

Combining (37) and (29) we obtain the approximation. 



^ 1 y ^min,k^{^max^k A) 

P(^) — 7^ / , ^ ^ 6*(A; Xmin,k'i Xmax,k) 

k=l ~ ^2 

~t“ ^ X^in^B^iXmax^B A) X 0(A, Xjnin^B ^ ^max,B ) • (39) 

It is easily confirmed that if kc = fco then the approximate density given by (39) 
is correctly normalized to leading order in N~'^, i.e. / dXp{X) = 1-1- 0{N~^). 

We note that the above approximation (39) is not restricted to dot-product 
kernels. It can be applied to other kernels and data distributions for which the 
process eigenvalues and their degeneracy are known, but it obviously limited by 
the validity of the starting Stieltjes transform relationship (3). 

4 Simulation Results 

In this section we compare the theory with simulation results. We consider a 
Gaussian RBF kernel with data uniformly distributed on the sphere. In figure 1 
we show the spectrum averaged over 1000 simulations for N data points of di- 
mension p = 10 with kernel width parameter = 1. For figure 1(a) with iV = 50 
only two peaks are discernable and the eigenvalues are greatly dispersed from 
their limiting values, whilst for figure 1(b), with N = 100, more structure is 
discernable. These are examples of a regime where the bounds developed by 
Shawe-Taylor et. al. would not be tight [6], since the sample and process eigen- 
values are not close. The numerical solution of the RPA result in eq.(25), shown 
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Fig. 1. We show the Gram matrix eigenvalue spectrum averaged over 1000 simulations 
(solid line) compared to the variational mean-field theory (dashed line) obtained by 
numerically solving eq.(25). Each data set was created by sampling N points uniformly 
from a p = 10 dimensional sphere and a Gaussian RBF kernel with = 1. a)N = 50. 
h)N = 100. 



by the dotted line in both figure 1(a) and figure 1(b), provides an impressive fit 
to the simulation results in this case. 

We might expect the statistical mechanics theory to work well for high dimen- 
sional data, but we would also like to test the theory for lower dimensionality. 
In figure 2 we consider the case of p = 2 dimensional data. This is far removed 
from typical case of high-dimensional data often considered in statistical me- 
chanics studies of learning. In figure 2(a) we show results of simulations with 
IV = 50 data points. Other parameters are as in figure 1. In this case several 
separate peaks are clearly visible. There is also an agglomeration of remaining 
eigenvalues around In A ~ —40 (not shown) corresponding approximately to the 
limit of machine and algorithmic precision in the simulations. The dashed line 
shows the approximate solution in eq.(39) of the RPA relationship in eq.(25). We 
have set kg = 14 in accordance with the number of resolved peaks which can be 
visually distinguished in the simulation results. In fact the approximate solution 
(39) yields a real density for ko < 25. The dotted line shows the full numerical 
solution of eq.(25), which is almost indistinguishable from the approximate so- 
lution. For the largest eigenvalues the qualitative agreement between the RPA 
and simulations is good. 

In figure 2(b) we show the averaged spectrum for a smaller data set of IV = 25 
points. All other parameters are as for figure 2(a). Clearly fewer resolved peaks 
are observable in this case and the remaining bulk is considerably more dispersed 
than in figure 2(a), as expected for smaller N. The validity of the perturbative 
approximate RPA solution in eq.(39) is more questionable in this case and the 
discrepancy between the full and approximate solutions (dashed and dotted lines 
respectively) to the RPA is more discernable. 

In order to use these results for model selection it would be useful to estimate 
the expectation value of the top eigenvalue. Figure 3(a) shows the convergence 
of the top eigenvalue to its asymptotic value as the number of data points N 
increases. We have plotted the log of the fractional error (between the top eigen- 
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Fig. 2. We show the Gram matrix eigenvalue density averaged over 1000 simulations 
(solid line) for (a) = 50 and (b) N = 25 data points distributed uniformly from a 

p — 2 dimensional sphere and Gaussian RBF with — 1. The dotted line shows the 
full numerical solution to eq.(25), over the range InA G [—26,0], and the dashed line 
shows the approximate analytical solution given by eq.(39). 

value and its asymptotic value, i.e. ((Ai)/Ai) — 1) against InA^. As in figure 2 
we have chosen p = 2 and a Gaussian RBF kernel with = 1. The solid circles 
show simulation results averaged over 1000 Gram matrices (error bars are of the 
order of the size of the plotted symbols) . Also plotted is the theoretical estimate 
from eq.(30) obtained from the approximate solution of the RPA. Glearly from 
the simulation results the top observed eigenvalue has an expectation value that 
converges as N ~2 to its asymptotic value. In figure 3(b) we observe that as the 
dimensionality p increases, the discrepancy between the theoretical estimate in 
eq.(30) and simulation decreases and then increases again. This is ultimately 
because the simulation results indicate that the fractional error scales as p^^'^ 
at large N, rather than ^ as suggested by eq.(30). To test the idea, suggested 
by eq.(30), that this convergence is universal for all dot-product kernels with 
an input distribution that is uniform on the sphere, we have also simulated the 
kernel k{x ■ y) = cosh(a + x ■ y). The constant a > 0 is to ensure that the 
first eigenvalue of this even kernel function is non-vanishing for p = 2 and we 
have chosen a = 0.5. The similarity between the simulation results for the two 
different kernels is apparent and although not exact, the two sets of simulation 
results appear to converge at large N. 

5 Discussion 

We studied the averaged spectrum of the Gram matrix in kernel PGA using 
methods adapted from statistical physics. We mainly focussed on a mean-field 
variational approximation, the Random Phase Approximation (RPA). The RPA 
was shown to reduce to an identity for the Stieltjes transform of the spectral 
density that is known to be the correct asymptotic solution for PGA. We devel- 
oped an approximate analytical solution to the theory that was shown to agree 
well with a numerical solution of this identity and the theory was shown to give a 
good qualitative match to simulation results. The theory correctly described the 
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Fig. 3. a) Log-log plot of fractional error of the top eigenvalue of the centred Gram 
matrix with N. Plotted are simulation results for p — 2: Gaussian RBF kernel with 
&^ = 1 (solid circles), a cosh(| + x ■ y) dot-product kernel (solid triangles k) and 
theoretical estimate given by eq.(30) (solid line), b) Log-log plot of the fractional error 
of the top eigenvalue of the centred Gram matrix with increasing dimensionality p and 
three values oi N = 128, 256 & 512. Also plotted are the theoretical estimates. 



scaling of the top eigenvalue with sample size but there were systematic errors 
because the scaling with dimension was not correctly predicted and further work 
is required to develop a better approximation for the top eigenvalue. 
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Abstract. We study the properties of the eigenvalues of Gram matri- 
ces in a non-asymptotic setting. Using local Rademacher averages, we 
provide data-dependent and tight bounds for their convergence towards 
eigenvalues of the corresponding kernel operator. We perform these com- 
putations in a functional analytic framework which allows to deal implic- 
itly with reproducing kernel Hilbert spaces of infinite dimension. This can 
have applications to various kernel algorithms, such as Support Vector 
Machines (SVM). We focus on Kernel Principal Component Analysis 
(KPCA) and, using such techniques, we obtain sharp excess risk bounds 
for the reconstruction error. In these bounds, the dependence on the 
decay of the spectrum and on the closeness of successive eigenvalues is 
made explicit. 



1 Introduction 

Due to their versatility, kernel methods are currently very popular as data- 
analysis tools. In such algorithms, the key object is the so-called kernel matrix 
(the Gram matrix built on the data sample) and it turns out that its spectrum 
can be related to the performance of the algorithm. This has been shown in 
particular in the case of Support Vector Machines [19]. Studying the behav- 
ior of eigenvalues of kernel matrices, their stability and how they relate to the 
eigenvalues of the corresponding kernel integral operator is thus crucial for un- 
derstanding the statistical properties of kernel-based algorithms. 

Principal Gomponent Analysis (PGA), and its non-linear variant, kernel-PGA 
are widely used algorithms in data analysis. They extract from the vector space 
where the data lie, a basis which is, in some sense, adapted to the data by look- 
ing for directions where the variance is maximized. Their applications are very 
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diverse, ranging from dimensionality reduction, to denoising. Applying PCA to 
a space of functions rather than to a space of vectors was first proposed by Besse 
[5] (see also [15] for a survey). Kernel-PCA [16] is an instance of such a method 
which has boosted the interest in PCA as it allows to overcome the limitations 
of linear PCA in a very elegant manner. 

Despite being a relatively old and commonly used technique, little has been done 
on analyzing the statistical performance of PCA. Most of the previous work has 
focused on the asymptotic behavior of empirical covariance matrices of Gaussian 
vectors (see e.g. [1]). In the non-linear setting where one uses positive definite 
kernels, there is a tight connection between the covariance and the kernel matrix 
of the data. This is actually at the heart of the kernel-PCA algorithm, but it 
also indicates that the properties of the kernel matrix, in particular its spectrum, 
play a role in the properties of the kernel-PCA algorithm. 

Recently, J. Shawe-Taylor, C. Williams, N. Cristianini and J. Kandola [17] have 
undertaken an investigation of the properties of the eigenvalues of kernel matri- 
ces and related it to the statistical performance of kernel-PCA. 

In this work, we mainly extend the results of [17]. In particular we treat the 
infinite dimensional case with more care and we refine the bounds using recent 
tools from empirical processes theory. We obtain significant improvements and 
more explicit bounds. 

The fact that some of the most interesting positive definite kernels (e.g. the Gaus- 
sian RBF kernel), generate an infinite dimensional reproducing kernel Hilbert 
space (the ’’feature space” into which the data is mapped), raises a technical 
difficulty. We propose to tackle this difficulty by using the framework of Hilbert- 
Schmidt operators and of random vectors in Hilbert spaces. Under some reason- 
able assumptions (like separability of the RKHS and boundedness of the kernel), 
things work nicely but some background in functional analysis is needed which 
is introduced below. 

Our approach builds on ideas pioneered by Massart [13], on the fact that Tala- 
grand’s concentration inequality can be used to obtain sharp oracle inequalities 
for empirical risk minimization on a collection of function classes when the vari- 
ance of the relative error can be related to the expected relative error itself. This 
idea has been exploited further in [2] . 

The paper is organized as follows. Section 2 introduces the necessary back- 
ground on functional analysis and the basic assumptions. We then present, in 
Section 3 bounds on the difference between sums of eigenvalues of the kernel 
matrix and of the associated kernel operator. Finally, Section 4 gives our main 
results on kernel-PCA. 



2 Preliminaries 

In order to make the paper self-contained, we introduce some background, and 
give the notations for the rest of the paper. 
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2.1 Background Material on Functional Analysis 

Let "H be a separable Hilbert space. A linear operator L from "H to "H is called 
Hilbert-Schmidt if X)i>i < °o , where (ei)i>i is an orthonormal basis of 

%. This sum is independent of the chosen orthonormal basis and is the squared 
of the Hilbert-Schmidt norm of L when it is finite. The set of all Hilbert-Schmidt 
operators on % is denoted by HS("H). Endowed with the following inner product 
(L, A)hs(w) = ^ij>i{Lei,ej){Nei,ej) , it is a separable Hilbert space. 

A Hilbert-Schmidt operator is compact, it has a countable spectrum and an 
eigenspace associated to a non-zero eigenvalue is of finite dimension. A compact, 
self-adjoint operator on a Hilbert space can be diagonalized i.e. there exists an 
orthonormal basis of TL made of eigenfunctions of this operator. If L is a compact, 
positive self-adjoint operator A(L) denotes its spectrum sorted in non-increasing 
order, repeated according to their multiplicities (Ai(A) > A 2 (A) > ...). An 
operator L is called trace-class if La) is a convergent series. In fact, 

this series is independent of the chosen orthonormal basis and is called the trace 
of L, denoted by trL . By Lidskii’s theorem tr L — X)i>i K{L). 

We will keep switching from TL to HS("H) and treat their elements as vec- 
tors or as operators depending on the context, so we will need the follow- 
ing identities. Denoting, for f,g & TL, hy f ® g the rank one operator de- 
fined as / 0 g{h) = {g, h) /, it easily follows from the above definitions that 
11/ ® 5|Ihs(w) = ll/llw hWu > and for A € HS(n), 

if ® g, ^)hs(w) “ {^9, f)-H ■ (1) 

We recall that an orthogonal projector in TL is an operator U such that = U 
and U = U* (hence positive). In particular one has ||D(/i)||^ = {h,Uh)^ . U 
has rank d < oo (i.e. it is a projection on a finite dimensional subspace), if and 
only if it is Hilbert-Schmidt with = Vd and trU = d. In that case it 

can be decomposed as U = 4'i ® where (^i’i)jLi i® an orthonormal basis 

of the image of U. 

If V denotes a closed subspaces of "H, we denote by II v the unique orthogonal 
projector such that ranTTy = V and kerTTy = F-*-. When V is of finite dimen- 
sion, 7Tyj_ is not Hilbert-Schmidt, but we will denote, for a trace-class operator 
A, {IIy±,A) =trA— {LIv,A)\is[-h) with some abuse of notation. 

2.2 Kernel and Covariance Operators 

We recall basic facts about random elements in Hilbert spaces. A random element 
Z in a separable Hilbert space has an expectation e G TL when E \\Z\\ < oo and 
e is the unique vector satisfying (e, f)-^ = E {Z, f)^ ,'i f GTL. Moreover, when 
E \\Z\^ < oo, there exists a unique operator C : TL ^ TL such that {f,Cg)^ = 
E [(/, Z)^ {g, Z)^] , V/, g G TL. C IS called the covariance operator of Z and is 
self-adjoint, positive, trace-class operator, with trC = E \\Z\\ (see e.g. [4]). 

The core property of kernel operators that we will use is its intimate rela- 
tionship with a covariance operator and it is summarized in next theorem. This 
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property was first used in a similar but more restrictive context (finite dimen- 
sional) by Shawe-Taylor, Williams, Cristianini and Kandola [17]. 

Theorem 1. Let {X, P) be a probability space, % be a separable Hilbert space 
and L> be a map from X to H such that for all h € H, {h,‘L>{.))^ is measurable 
and E\\'P{X)\\^ < oo. Let C be the covariance operator associated to 'P{X) and 
K : L 2 {P) — >■ L 2 {P) be the integral operator defined as 

{Kf){x) = I f{y) {<l>ix),<l>{y))^dPiy). 



Then X{K) = A(C) . 

In particular, K is a positive self-adjoint trace-class operator and tr(K) = 

EllWf = 



2.3 Eigenvalues Formula 

We denote by Vd the set of subspaces of dimension doiH.. The following theorem 
whose proof can be found in [18] gives a useful formula to compute sums of 
eigenvalues. 

Theorem 2 (Fan). Let C a compact self-adjoint operator on TL, then 

d 

Ai (C) = {Hv , C) hs{h) ’ 

i—1 



and the maximum is reached when V is the space spanned by the first d eigen- 
vectors of C. 

We will also need the following formula for single eigenvalues. 

Theorem 3 (Courant-Fischer-Weyl, see e.g. [9]). Let C a compact self- 
adjoint operator on TL, then for all d> 1, 



Xd{C) 



min max 
veVd-i f±v 



if,Cf) 

ll/f 



where the minimum is attained when V is the span of the first d—1 eigenvectors 
ofC. 

2.4 Assumptions and Basic Facts 

Let X denote the input space (an arbitrary measurable space) and P denote a 
distribution on X according to which the data is sampled i.i.d. 

We will denote by P„ the empirical measure associated to a sample Xi, . . . , A„ 
from P, i.e. Pn = With some abuse of notation, for a function 

/ : A — >• M, we may use the notation Pf := E [f{X)] and P„f := A YHj=i 
Also, £i,... will denote a sequence of Rademacher random variables (i.e. 
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independent with value +1 or —1 with probability 1 / 2 ). 

Let fc be a positive definite function on X and the associated repro- 
ducing kernel Hilbert space. They are related by the reproducing property: 
V/ G G X ,{f,k{x, .))uk = We denote by Vd the set of all vec- 

tor subspaces of dimension d of Hk- 
We will always work with the following assumption. 

Assumption 1. We assume that 

— For all X € X, k{x, .) is P -measurable. 

— There exists M > 0 such that k{X,X) < M P-almost surely. 

— FLk is separable. 

For a: G A, we denote (px = k{x, .) understood as an element of "Hfe. 

Let Cx the operator defined on FLk by 

(/, Cxg)-H^ = f{x)g{x) . 

It is easy to see that Cx = Px ® Tx and Cx is trace-class with trC^; = k{x,x) 
and ||C^||hs(«g = k{x,xf. 

Also, from the definitions and by ( 1 ) we have for example (Ca,, Cy)HS(Wfc) = 
and , for any projector U, WUipxWn^ = {U,Cx)HS{Uk) ■ 

We will denote by Ci : FLk dik (resp. C2 ■ HS{FLk) HS('Hfc)) the covariance 
operator associated to the random element (px in FLk (resp. Cx in HS{FLk)). Also, 
let Ki (resp. K2) be the integral operator with kernel k{x,y) (resp. k{x,y)'^). 

Lemma 1. Under Assumption 1 the operators C\,C2, K\, K2 defined above are 
trace-class with trCi = E[/c(A, A)], trC2 = E[fc^(A, A)]. They satisfy the 
following properties 

(i) A(Ci) = A(Ai) and X{C2) = KK2) . 

(ii) C\ is the expectation in AS{FLk) of Cx- 

(Hi) C2 is the expectation in HS(HS("Hfe)) ofCx 'f^Cx. 

Proof, (i) To begin with, we prove that trCi = Ek{X,X) and A(Ci) = A(Ai) 
by applying Theorem 1 with <F(x) = px'- since k{x,-) is measurable, all linear 
combinations and pointwise limits of such combinations are measurable, so that 
all the functions in FLk sxe measurable. Hence measurability, for h G FLk of 
X !->■ {F>x, follows and we have E ||^x||^ = E/c(A, A) < 00. 

Then, we prove that trC2 = Efc^(A, A) and A(C2) = A(A2) by applying Theo- 
rem 1 with <P{x) = Cx- for h G HS("Hfc) with finite rank (i.e. h = ® V’i 

for an orthonormal set 4>i and ipi = h*(f>i), the function x 1— >■ (C's, ^)hs(W)c) “ 
is measurable (since 4>i and ipi are measurable as elements 
of FLk). Moreover, since the finite rank operators are dense in HS("Hfc) and 
h I— >■ (Cx) ^)HS(Wfc) i® continuous, we have measurability for all h G HS("Hfc). 

Finally, we have E ||Cx|lHS(-Hfc) ~ Efc^(A, A) < 00. 

(ii) Since E IjCjc = Efc(A, A) < 00 the expectation of Cx is well de- 
fined in HS("Hfc). Moreover for all f,g £ FLk, (ECjc/, g) = (ECxj^® /) := 
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nCx, 9 ^ f) =ncxf,g) =Kf{X)g{X) = {C^f,g) 

(iii) Using HCjc (g) C'jc||hs(hs(«C) = l|C'^llL(«C = k{X,Xf and a similar ar- 
gument gives the last statement. □ 

The generality of the above results implies that we can replace the distribu- 
tion P by the empirical measure Pn associated to an i.i.d. sample Xi, . . . ,Xn 
without any changes. If we do so, the associated operators are denoted by 
(which is identified [12] with the normalized kernel matrix of size n x n, = 
{k{Xi, Xj)/n)ij=i^,,, ^n) and Ci^n which is the empirical covariance operator 
(i.e. (/, Ci,„ 5 ) = ^^i^if{Xi)g{Xi)). We can also define ^ 2 ,™ and C 2 ,n simi- 
larly. In particular, Theorem 1 implies that A(A'i^ri) = A(C'i,„) and \{K 2 ^n) = 
A(C 2 ,„) and trATi^n = trCpn = ^Yli^ik{Xi,Xi) while trA: 2 .n = trC' 2 ,„ = 
iY:=ik\Xi,X,). 

3 General Results on Eigenvalues of Gram Matrices 

We first relate sums of eigenvalues to a class of functions of type x >->■ (Ily, C^)- 
This will allow us to introduce classical tools of Empirical Processes Theory to 
study the relationship between eigenvalues of the empirical Gram matrix and of 
the corresponding integral operator. 

Corollary 1. Under Assumption 1, we have 

d 

VAfc(iCi) = maxE[(77y,Cx)] and V Afc(iCi) = min E [(TTy^ , Cx)j . 

k=i k>d+i 

Proof. The result for the sums of the largest eigenvalues follows from Theorem 
2 applied to Ci and Lemma 1. For the smallest ones, we use the fact that 
trCi = EtrCjc = J2k>i ^fc(C'i) : and (77y,Cx) + {IIy±,Cx) = trCx- □ 

Notice that similar results hold for the empirical versions (replacing P by P„). 



3.1 Global Approach 

In this section, we obtain concentration result of the sum of the largest eigen- 
values and of the sum of the lowest towards eigenvalues of the integral operator. 
We start with an upper bound on the Rademacher averages of the corresponding 
classes of functions. 

Lemma 2. 



Ep 



- sup Vsj {nyj_,Cxj) 



= Ep 






< 



tr Ao 



n 



Proof. We use the symmetry of Si, Theorem 8 with r — >■ oo and h = 0, and 
Lemma 1. □ 
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We now give the main result of this section, which consists in data-dependent 
upper and lower bounds for the largest and smallest eigenvalues. 

Theorem 4. Under Assumption 1, with probability at least 1 — 3e“^, 



Also, with probability at least 1 — 3e“^, 



^ ^ A,(Xi,„)<2^-trX2.„ + 3M^|-. (3) 

Proof. We start with the first statement. Recall that 

d d 

E K{Ki,n) - y] K{Ki) = max (77y, Ci^n) ~ max (ily, Ci) . 

^ VGVd veVd 

i—1 2=1 

This gives, denoting by Vd the subspace attaining the second maximum, 

d d 

(P„ - P) {nVd,Cx) < V - V K{Ki) < sup (P„ - P) {nv,Cx) . 

veVd 



To prove the upper bound, we use McDiarmid’s inequality and symmetrization 
as in [3] along with the fact that, for a projector U, (U,Cx) < ||</?a:||^ < M. We 
conclude the proof by using Lemma 2. The lower bound is a simple consequence 
of Hoeffding’s inequality [10]. The second statement can be proved via similar 
arguments. □ 

It is important to notice that the upper and lower bounds are different. To 
explain this, following the approach of [17] where McDiarmid’s inequality is 
applied to directly^, we have with probability at least 1 — e~^, 



2=1 






2=1 



2n 



Then by Jensen’s inequality, symmetrization and Lemma 2 we get 



0 < E 



■ d 

Y^HKl.n) 

. 2=1 



d 

y;Ai(77i)<E 

2=1 



sup (P„ 
veVd 



P) (nv,Cx) 




tr K2 ■ 



We see that the empirical eigenvalues are biased estimators of the population 
ones whence the difference between upper and lower bound in (2). Note that 
applying McDiarmid’s inequality again would have given precisely (2), but we 
prefer to use the approach of the proof of Theorem 4 as it can be further refined 
(see next section). 

^ Note that one could actually apply the inequality of [7] to this quantity to obtain a 
sharper bound. This is in the spirit of next section. 
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3.2 Local Approach 

We now use recent work based on Talagrand’s inequality (see e.g. [13,2]) to 
obtain better concentration for the large eigenvalues of the Gram matrix. We 
obtain a better rate of convergence, but at the price of comparing the sums of 
eigenvalues up to a constant factor. 

Theorem 5. Under Assumption 1, for all a > 0 and C > 0; with probability at 
least 1 — e~^, 

E A.(Kr,„)- (l + «) MK^) < + + + + (4) 

n 

k=l k=l 

where 



r*d < inf 



Mh 



h>o n 






j>h 



Moreover, with probability at least 1 — e for all a G (0, 1), 



E - (1 - «) E ^ • (5) 

k=l k=l 

Notice that the complexity term obtained here is always better than the one 
of (2) (take h = 0). As an example of how this bound differs from (2), assume 
that \j{K 2 ) = with a > 1, then (2) gives a bound of order y^d/n, while 

the above Theorem gives a bound of order which is better. In 

the case of an exponential decay (Xj{K 2 ) = 0(e“^f) with 7 > 0), the rate even 
drops to log(rid) jn. 



4 Application to Kernel-PCA 

We wish to find the linear space of dimension d that conserves the maximal 
variance, i.e. which minimizes the error of approximating the data by their pro- 
jections. 

1 " 

Vn = argmin- 'V' \\(pxj ~ iIv(v3xJ|P • 
ceVd 

Vn is the vector space spanned by the first d eigenfunctions of Analogously, 
we denote by Vd the space spanned by the first d eigenfunctions of Ci. We will 
adopt the following notation: 

1 " 

Rn{V) = - E 11‘f’W “ Rvi^XjW = Pn {ny^,Cx) ■ 

^ j=i 



R{V) =¥.[\\ipx - Hv^xf] =P{ny^,Cx) ■ 
One has i?„(K„) = and R{Vd) = 
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4.1 Bound on the Reconstruction Error 

We give a data dependent bound for the reconstruction error. 

Theorem 6. Under Assumption 1, with probability at least 1 — 2e“^, 

R{Vn)< + 2^ ^ tr K2,n + . 

i=^d-\-l 

Proof. We have 

R{V^) - Rn{V^) = {P - Pn) (ny^,Cx) < SUp (P - P„) (ilyr , Cjf) ■ 

" veVd 

We have already treated this quantity in the proof of Theorem 4. □ 

In order to compare the global and the local approach, we give a theoretical 
bound on the reconstruction error. By definition of Vn, we have R(Vn) — R{Vd) < 
2supygy^(i? — Rn){Vn) so that from the proof of Theorem 4 one gets 

R{Vn) - R{Vd) < 4^ ^tr{K2) + . 



4.2 Relative Bound 



We now show that when the eigenvalues of the kernel operator are well sepa- 
rated, estimation becomes easier in the sense that the excess error of the best 
empirical d-dimensional subspace over the error of the best d-dimensional sub- 
space can decay at a much faster rate. 

The following lemma captures the key property which allows this rate improve- 
ment. 



Lemma 3. For any sub space V URk, 



Var 



{IIy± ,Cx) - ( P[y ± , Cx 



< E 



dlyj- — TTy 



-,CxY 



and for all V G Vd, with Ad(C'i) > Ad+i(Ci), 



E 



IIy± — IIy ± , Cx } 



^ 2y/Efc4(X,W) 

Ad(C'i) — Ad+i(Ci) 



Ilyl. — 7 Ti 



y^^,Cx 



where X' is an independent copy of X. 
Here is the main result of the section. 



(6) 



Theorem 7. Under Assumption 1, for all d such that Ad(C'i) > Ad+i(C'i), for 
all ^ > 0, with probability at least 1 — e~^ 



R{Vn) - R{Vd) < 705 inf 

h>0 



Bdh 



E 



f{22M + 27Bd) 



j>h+i 



where Ba = 2^m^ {X, X') / {\d{Ci) - Xd+i{Ci)). 
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It is easy to see that the term X') is upper bounded by Ek'^{X, X). 

Similarly to the observation after Theorem 5, the complexity term obtained here 
will decay faster than the one of Theorem 6, at a rate which will depend on the 
rate of decay of the eigenvalues. 

5 Discussion 

Dauxois and Pousse [8] studied asymptotic convergence of PCA and proved al- 
most sure convergence in operator norm of the empirical covariance operator to 
the population one. These results were further extended to PCA in a Hilbert 
space by [6]. However, no finite sample bounds were presented. 

Compared to the work of [12] and [11], we are interested in non-asymptotic (i.e. 
finite sample sizes) results. Also, as we are only interested in the case where 
k{x,y) is a positive definite function, we have the nice property of Theorem 1 
which allows to consider the empirical operator and its limit as acting on the 
same space (since we can use covariance operators on the RKHS) . This is crucial 
in our analysis and makes precise non-asymptotic computations possible unlike 
in the general case studied in [12,11]. 

Comparing with [17], we overcome the difficulties coming from infinite dimen- 
sional feature spaces as well as those of dealing with kernel operators (of infinite 
rank). Moreover their approach for eigenvalues is based on the concentration 
around the mean of the empirical eigenvalues and on the relationship between 
the expectation of the empirical eigenvalues and the operator eigenvalues. But 
they do not provide two-sided inequalities and they do not introduce Rademacher 
averages which are natural to measure such a difference. Here we use a direct ap- 
proach and provide two-sided inequalities with empirical complexity terms and 
even get refinements. Also, when they provide bounds for KPCA, they use a 
very rough estimate based on the fact that the functional is linear in the feature 
space associated to k"^. Here we provide more explicit and tighter bounds with 
a global approach. Moreover, when comparing the expected residual of the em- 
pirical minimizer and the ideal one, we exploit a subtle property to get tighter 
results when the gap between eigenvalues is non-zero. 

6 Conclusion 

We have obtained sharp bounds on the behavior of sums of eigenvalues of Gram 
matrices and shown how this entails excess risk bounds for kernel-PCA. In par- 
ticular our bounds exhibit a fast rate behavior in the case where the spectrum of 
the kernel operator decays fast and contains a gap. These results significantly im- 
prove previous results of [17]. The formalism of Hilbert-Schmidt operator spaces 
over a RKHS turns out to be very well suited to a mathematically rigorous treat- 
ment of the problem, also providing compact proofs of the results. We plan to 
investigate further the application of the techniques introduced here to the study 
of other properties of kernel matrices, such as the behavior of single eigenvalues 
instead of sums, or eigenfunctions. This would provide a non-asymptotic version 
of results like those of [1] and of [6]. 




604 L. Zwald, O. Bousquet, and G. Blanchard 



Acknowledgements. The authors are extremely grateful to Stephane 
Boucheron for invaluable comments and ideas, as well as for motivating this 
work. 



References 

1. T. W. Anderson. Asymptotic theory for principal component analysis. Ann. Math. 
Stat, 34:122-148, 1963. 

2. P. Bartlett, O. Bousquet, and S. Mendelson. Localized Rademacher complexities, 
2003. Submitted, available at 

http://www.kyb. mpg.de/publications/pss/ps2000.ps. 

3. P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk 
bounds and structural results. Journal of Machine Learning Research, 3:463-482, 
2002 . 

4. P. Baxendale. Gaussian measures on function spaces. Amer. J. Math., 98:891-952, 
1976. 

5. P. Besse. Etude descriptive d’un pocessus; approximation, interpolation. PhD 
thesis, Universite de Toulouse, 1979. 

6. P. Besse. Approximation spline de I’analyse en composantes principales d’une 
variable aleatoire hilbertienne. Ann. Fac. Sci. Toulouse (Math.), 12(5):329-349, 
1991. 

7. S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with 
applications. Random Structures and Algorithms, 16:277-292, 2000. 

8. J. Dauxois and A. Pousse. Les analyses factorielles en calcul des probabilites et en 
statistique: essai d’ etude synthetique. PhD thesis. 

9. N. Dunford and J. T. Schwartz. Linear Operators Part II: Spectral Theory, Self 
Adjoint Operators in Hilbert Space. Number VII in Pure and Applied Mathematics. 
John Wiley & Sons, New York, 1963. 

10. W. Hoeffding. Probability inequalities for sums of bounded random variables. 
Journal of the American Statistical Association, 58:13-30, 1963. 

11. V. Koltchinskii. Asymptotics of spectral projections of some random matrices 
approximating integral operators. Progress in Probability, 43:191-227, 1998. 

12. V. Koltchinskii and E. Gine. Random matrix approximation of spectra of integral 
operators. Bernoulli, 6(1):113-167, 2000. 

13. P. Massart. Some applications of concentration inequalities to statistics. Annales 
de la Faculte des Sciencies de Toulouse, IX:245-303, 2000. 

14. S. Mendelson. Estimating the performance of kernel classes. Journal of Machine 
Learning Research, 4:759-771, 2003. 

15. J. O. Ramsay and G. J. Dalzell. Some tools for functional data analysis. Journal 
of the Royal Statistical Society, Series B, 53(3):539-572, 1991. 

16. B. Scholkopf, A. J. Smola, and K.-R. Muller. Kernel principal component analysis. 
In B. Scholkopf, G. J. C. Burges, and A. J. Smola, editors. Advances in Kernel 
Methods - Support Vector Learning, pages 327-352. MIT Press, Cambridge, MA, 
1999. Short version appeared in Neural Computation 10:1299-1319, 1998. 

17. J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. Eigenspectrum of 
the gram matrix and its relationship to the operator eigenspectrum. In Algorithmic 
Learning Theory : 13th International Conference, ALT 2002, volume 2533 of Lec- 
ture Notes in Computer Science, pages 23-40. Springer- Verlag, 2002. Extended ver- 
sion available at http://www.support-vector.net/papers/eigenspectrum.pdf. 




Statistical Properties of Kernel Principal Component Analysis 605 



18. M. Torki. Etude de la sensibilite de toutes les valeurs propres non nulles d’un 
operateur compact autoadjoint. Technical Report LAO97-05, Universite Paul 
Sabatier, 1997. Available at http://mip.ups-tlse.fr/publi/rappLAO/97.05.ps.gz. 

19. R. C. Williamson, J. Shawe-Taylor, B. Scholkopf, and A. J. Smola. Sample-based 
generalization bounds. IEEE Transactions on Information Theory, 1999. Submit- 
ted. Also: NeuroCOLT Technical Report NC-TR-99-055. 



A Localized Rademacher Averages on Ellipsoids 



We give a bound on Rademacher averages of ellipsoids intersected with balls 
using a method introduced by Dudley. 

Theorem 8. Let H be a separable Hilbert space and Z be a random variable 
with values in H. Assume E [||^|P] < oo, and let C be the covariance operator 
of Z. For an i.i.d. sample^ Zi,... ,Z„, denote by Cn the associated empirical 
covariance operator. Let Ba = {||u|| < a}, £r = {{v,Cv) < r} and £n,r = 
{{v,Cnv) < r}. We have 



E,. 



and 



sup -y^Si{v,Zi) 






< inf < -N/^ + a, 

\jn 0<h<n 



. I E ’ (7) 

\ j=h+l 



E 



1 J V 

sup -y^Ei{v,Zi) 
•jEB^ner- n ^ 



< 



inf < Vhr + a I A,- (C) > . (8) 



Proof. We will only prove (8), the same argument gives (7). Let be an 

orthonormal basis of H of eigenvectors of C. Define p = min{i : Ai(C) = 0}. If 
we prove the result for h < p we are done, so we assume h < p. For v G H Sr, 
we have 



^Si{v,Z,) = l^{v,Fj)<Pj,^e,Z, 
\i=i 



Z=1 



2=1 



\ j>h \i=l / 



< 






V i>h+i j=i 



where we used Cauchy-Schwarz inequality and (v,Cv) = X)i>i Ai(C)(u, ^i)^. 
Moreover 



= E [{Z,<P,f] = = \{C) . 

^ i=i 

We finally obtain (8) by Jensen’s inequality. □ 

^ The result also holds if the Zi are not independent but have the same distribution. 
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Notice that Mendelson [14] shows that these upper bounds cannot be improved. 
We also need the following lemma. Recall that a sub-root function [2] is a 
non-decreasing non-negative function on [0,oo) such that ip^x) / ^/{x) is non- 
increasing. 

Lemma 4. Under the conditions of Theorem 8, denoting by ip the function 

-k I ’ 

we have that if is a sub-root function and the unique positive solution r* of 
if{r) = rfc where c > 0 satisfies 




Proof. It is easy to see that the minimum of two sub-root functions is sub- 
root, hence if as the minimum of a collection of sub-root function is sub-root. 
Existence and uniqueness of a solution is proved in [2]. To compute it, we use 
the fact that x < A^/x B implies x < 2B. 

We finish this section with two corollaries of Theorem 8 and Lemma 4. 

Corollary 2. Define Wd “ € Vd : E (^T[y± — Tly ^ , Cx^ < ?''i , then 



E 



1 

sup - y' £i ( nyr - Hy ± , Cxi 




Vrh 2 






j>h 



Proof. This is a consequence of Theorem 8 since ||i7y — ^ 4d, so 

that for V G Wd, Py G Bddi^Sr with Sr = {v G HS('Hfc), (ri, C' 2 u)HS(Wfe) < ^}- 



Corollary 3. Define Wd = G Vd ■ E {Py,Cx)‘^ < r| then, 



E 



sup -'^S^{^y,Cx^) 
yeWd i=i 



< 





k'>h-\-l 



Proof. Use the same proof as in Corollary 2. 



□ 



B Proofs 

Proof (of Theorem 1). Then is a random element of Tl. By assumption, 

each element h & TL can be identified to a measurable function x i— {f,<P{x)). 
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Also, if E[||^(A)||] < oo, ’P(X) has an expectation which we denote by 
E [<?(AT)] G %. Consider the linear operator T : "H — L 2 {P) defined as {Th){x) = 
By Cauchy-Schwarz inequality, E(/i,^(A))^ < ||ft,|pE||<P(A)|p. Thus, 
T is well-defined and continuous, thus it has a continuous adjoint T* . Let 
/ G L 2 {P), then (E ||/(A)<?(A)||)2 < H/fE ||<l>(A)f . So, the expectation of 
f{X)<P{X) G n can be defined. But for all g G H, {T*f,g)^ = {f,Tg)^^^p^ = 
E [{g, f{X)^{X))p} which shows that T*(/) = E [^{X)f{X)] . 

We now prove that C = T*T and K = TT* . By the definition of the 
expectation, for all h,h' G %, {h,T*T{h')) = {h,E[(P{X) {<P{X),h')]) = 
E[(/i,^(A)) (h','P{X))] . Thus, by the uniqueness of a covariance operator, we 
get C = T*T. Similarly {TT* f){x) = {T*f,T{x)) = E[{f{X)<P{X),<P{x))] = 
f f{y) dP{y) so that K = TT* . By singular value decomposition, 

it is easy to see that A(C') = X{K) if T is a compact operator. Actually, T 
is Hilbert-Schimdt. Indeed, ||T|||g(„) = X]i>i \\Teif = Ei>i ® [(e^, ^(^))^] = 
E [||^(A)|p] . Hence, T is compact, C is trace-class (trC = and since 

trTT* = trT*T, K is trace-class too. □ 

Proof (of Theorem 5). As in the proof of Theorem 4, we have to bound 
supygy^(P„ — P) {IIv,Cx)- We will use a slight modification of Theorem 3.3 
of [2]. It is easy to see that applying Lemma 3.4 of [2] to the class of functions 
{/' = — /; / G P}, with the assumption T{f) < —BPf, one obtains (with the 
notations of this lemma). 



^ XBK ’ 

so that under the assumptions of Theorem 3.3, one can obtain the following 
version of the result 

Pf < + 

A -I- 1 B n 

which shows (for the initial class) that 



< :^P/ + 704(j^+l) ^. ^ f{ll{b-a){K+l)lK + 2QB{K+l)) 
” “ AT B n 



We apply this result to the class of functions x >->■ {ny,Cx) for V G Vd, which 
satisfies P (ily, C^,)^ < MP (TTy, C^,) , and {IIy,Cx) G [0, M] , and use Lemma 
4. We obtain that for all a > 0 and ^ > 0, with probability at least 1 — e“^, 
every V" G Vd satisfies 



Pn {Bv, Cx) < {l + a)P {IIv, Cx) + 704(1 + a + 



MC(ll(l + a) + 26(l + g-^)) 
n 



where r*^ = ^ and MTd{r*) = r* . Td{r) is the sub-root function that appeared 
in Corollary 3 This concludes the proof. Inequality (5) is a simple consequence 
of Bernstein’s inequality. □ 
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Proof (of Lemma 3). The first inequality is clear. For the second we start with 



E 




-n„ 




= {^Vd - nv, C2llvd - nv)ns(H) 



< 11(^21111^1/^ -^1/|IhS(W) 

= 2||C'2|| {d — {Uy , .f7Vd)HS(W)) ■ 



(9) 



By Lemma 1, E 



Lly-L — n 



orthonormal basis , 

first d eigenvectors of C\. 
Moreover, we have 



-,Cx 



= (ilvd — Lly, Cl) . We now introduce an 



doiV and the orthonormal basis ,d of the 



d d 

{ny, - ny, Cl) = ^ A.(Ci) - ^(/i, Cl/.) . 

i=l i=l 

We decompose fi = X)j=i(/i> 4'j)4>j + 9i > where gt G span(())i, . . . , cfd)-'- so that 

d 

(/., Cif.) = E + (9^, Ciff.) , 

i=i 

Theorem 3, implies {gi^Cigf) < Ad+i(Ci)(l - ^ hence we get 



d d d 

{ny, - ny, Cl) > E A*(Ci)(i - E(/i>*)') - Ad+i(C'i)(d - E (/- ■ 

i=l j = l i,i=l 

Using 1 — X]j=i(/i) (</'*) IP ^ the fact that the eigenvalues 

of Cl are in a non-decreasing order we finally obtain 

d 

{ny, - ny,cf) > (Ad(Ci) - Ad+i(Ci))(d- E (/*.</'.)") ■ (10) 

i,i=l 



Also we notice that IIC 2 II < ||C2||hs(hs(«G) = ll-f(^ 2 ||HS(L 2 (P)) (by 

Lemma 1) and since K 2 is an integral operator with kernel k^{x,y), 
ll-f^2|lHS(HS(Wfe)) = J k‘^(x,y)dP{x)dP{y). Now, Equation (1) gives 
{ny,ny,)^^^^^ = X)i*i=i(/G 0i)|i- Combining this with Inequalities (9) 
and (10) we get the result. □ 



Proof (of Theorem 7). We will apply Theorem 3.3 of [2] to the class of functions 



fy \ x^ {ny± — ny±,Cxj for V &Vd and taking U = U„ will give the result. 

With the notations of [2], we set T{fy) = E [fy{X)‘^] and by Lemma 3 we 
have T{fy) < Bd(E[fy{X)] . Also, fy{x) G Moreover, we can upper 

bound the localized Rademacher averages of the class fy using Corollary 2, 
which combined with Lemma 4 gives the result. □ 
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Abstract. We propose an algorithm for permuting or sorting multiple 
sets (or bags) of objects such that they can ultimately be represented ef- 
ficiently using kernel principal component analysis. This framework gen- 
eralizes sorting from scalars to arbitrary inputs since all computations 
involve inner products which can be done in Hilbert space and kernel- 
ized. The cost function on the permutations or orderings emerges from a 
maximum likelihood Gaussian solution which approximately minimizes 
the volume data occupies in Hilbert space. This ensures that few kernel 
principal components are necessary to capture the variation of the sets 
or bags. Both global and almost-global iterative solutions are provided 
in terms of iterative algorithms by interleaving variational bounding (on 
quadratic assignment problems) with a Kuhn-Munkres algorithm (for 
solving linear assignment problems). 



1 Introduction 

Sorting or ordering a set of objects is a useful task in practical unsupervised 
learning as well as in general computation. For instance, we may have a set 
of unordered words describing an individual’s characteristics in paragraph form 
and we may wish to sort them in a consistent manner into fields such that the 
first field or word describes the individual’s eye color, the second word describes 
his profession, the third word describes his gender, and so forth. Alternatively, 
as in Figure 1, we may want to sort or order dot-drawings of face images such 
that the first dot is consistently the tip of the nose, the second dot is the left 
eye, the third dot is the right eye and so forth. However, finding a meaningful 
way to sort or order sets of objects is awkward when the objects are not scalars 
(scalars can always be sorted using, e.g. quick-sort). We instead propose sorting 
many bags or sets of objects such that the resulting sorted versions of the bags 
are easily representable using a small number of kernel principal components. 
In other words, we will find the sorting or ordering of many bags of objects 
such that the manifold formed by these sorted bags of objects will have low 
dimensionality. 

In this article, we refer to sorting or ordering in the relative sense of the 
word and seek the relative ordering between objects in two or more unordered 
sets. This is equivalent to finding the correspondence between multiple sets of 
objects. A classical incarnation of the correspondence task (also referred to as 
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(a) 3 Unsorted dot images 
Fig. 1. Sorting or matching of 3 bags 
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(b) 3 Sorted dot images 
8 (x, y) coordinates representing faces. 



matching, permutation or ordering between sets) is the so-called linear assign- 
ment problem (LAP). A familiar example of LAP is in an auction or garage-sale 
where N goods are available and N consumers each attribute a value to each 
good. This solution to LAP is the the best pairing of each consumer to a single 
good such that the total value obtained is maximal. This is solvable using the 
classical Kuhn-Munkres algorithm in 0{N^) time. Kuhn-Munkres provides a 
permutation matrix capturing the relative ordering between the two sets (goods 
and consumers). 

Recent efficient variants of Kuhn-Munkres make it practical to apply to bags 
of thousands of objects [3]. Alternatively, relaxations of LAP have been pro- 
posed including the so-called invisible hand algorithm [8] . These tools have been 
used for finding correspondence and aligning images of, for instance, digits [2,14] 
to obtain better models (such as morphable or corresponded models). In fact, 
handling permutable or unordered sets is relevant for learning and image clas- 
sification as well. For example, permutable images and other objects have been 
handled via permutationally invariant kernels for support vector machine classi- 
fiers [7] or permutationally invariant expectation-maximization frameworks [6]. 
It is known that removing invariant aspects of input data (such as permutation) 
can improve a learning method [13]. Another approach is to explicitly estimate 
the ordering or permutation by minimizing the number of principal components 
needed to linearly model the variation of many sets or bags of objects [5,4]. 

In this paper, we build up a novel algorithm starting from the Kuhn-Munkres 
algorithm. Kuhn-Munkres sorts only a pair of bags or sets containing N vector- 
objects such that we minimize their squared norm. Our novel algorithm upgrades 
the search for an ordering from two bags to many simultaneous bags of objects by 
iterating the Kuhn-Munkres algorithm with variational bounds. The iterations 
either minimize the squared norm from all sorted bags to a common “mean bag” 
or minimize the dimensionality of the resulting manifold of sorted bags. These 
two criteria correspond to a generalization of the linear assignment problem and 
to the quadratic assignment problem, respectively. Both are handled via iterative 
solutions of the Kuhn-Munkres algorithm (or fast variants). We also kernelize 
the Kuhn-Munkres algorithm such that non- vectorial objects [11] can also be 
ordered or sorted. 
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2 Permuting Several Sets 

Consider a dataset V oiT sets or bags V = {xt}^=\ - Each of these bags is merely 
a collection of N unordered objects Xt = wish to find an ordering 

for objects in these bags that makes sense according to some fairly general crite- 
rion. However, in the general case of bags over unusual objects (vectors, strings, 
graphs, etc.) it is not clear that a natural notion of ordering exists a priori. We 
will exploit kernels since they have been shown to handle a diverse range of input 
spaces. If our sorting algorithms leverage these by exclusively using generalized 
inner products within sorting computations we would be able to sort a variety of 
non-scalar objects. We therefore propose another criterion for sorting that finds 
orderings. The criterion is that the resulting ordered bags can be ejficiently en- 
coded using principal components analysis (PCA) or kernel principal component 
analysis (kPCA) [12]. Essentially, we want kPCA to capture the variation seen 
in the dataset with as few dimensions as possible. 

We will eventually deal with non-vectorial objects but for simplicity, we could 
assume that all bags simply contain N vectors of dimensionality D. Thus, we 
assume each xt,n G and we can rewrite each bag y* in an x D matrix 
form as A*. Our dataset of many bags can then be stored as T matrices and 
consists of To reorder each of these bags, we consider endowing each 

matrix Xt with an unknown N x N permutation matrix At which re-sorts its 
N row entries. Therefore, we augment our dataset with matrices that re-sort it 
as follows {AtXt}J_i- In the more general case where we are not dealing with 
vectors for each 'yt,n, we will take the permutation matrices At to be a general 
permutation pt of the set {!,..., A} which defines an ordering of the bag as 
follows pt®Xt = This gives us an ordered version of the dataset 

for a specific configuration of orderings denoted P which we write as follows 
Pp = {Pt ® XilLi- 

Given the original dataset, we want to find a good permutation configuration 
by optimizing the matrices {At\^^i or the permutation configurations 
To make the notion of goodness of permutation configurations concrete, we will 
argue that good permutations will reveal a compact low-dimensional represen- 
tation of the data. For instance, the data may lie on a low dimensional manifold 
that is much smaller than the embedding space of size ND or N\xt,n\, where 
\Xt,n\ is the dimensionality of the objects being permuted (if and when such a 
quantity makes sense). We now elaborate how to approximately measure the 
dimensionality of the potentially nonlinear manifold spanning the data. This is 
done by observing the eigenvalue spectrum of kernel PCA which approximates 
the volume data occupies in Hilbert space. Clearly, a low volume suggests that 
we are dealing with a low dimensional manifold in Hilbert space. 



2.1 Kernel PCA and Gaussians in Hilbert Space 

We subscribe to the perspective that PCA finds a subspace from data by mod- 
eling it as a degenerate Gaussian since only first and second order statistics of 
a dataset {xtjt^i are computed [7]. Similarly, kernel PCA finds a subspace in 
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Hilbert space only looking at first and second order statistics of the feature 
vectors instead^. In fact, we are also restricted to second order statis- 

tics since we wish to use kernel methods and can thus only interact with data 
in Hilbert space via inner-products k{xt,xt') = {4>{xt) , 4>ixt')) ■ 

One way to evaluate the quality of a subspace discovered by kernel PCA is 
by estimating the volume occupied by the data. In cases where the volume of 
the data in Hilbert space is low, we anticipate that only a few kernel principal 
components will be necessary to span and reconstruct the dataset. Since kernel 
PCA hinges on Gaussian statistics, we will only use a second order estimator of 
the volume of our dataset. Consider computing the mean and covariance of a 
Gaussian from the dataset in Hilbert space. In kernel PCA [12], recall that the 
top eigenvalues of the covariance matrix S = ^ (j){xt)(f>{xt)'^ of the data are 

related to the top eigenvalues of the T xT Gram matrix K of the data which is 
defined element-wise as [K]t^t' = k{xt,Xf). The eigenvalues A and eigenvectors 
ct of the Gram matrix are given by the solution to the problem: 



{kx\ 1 ) 


1 ^Xt ) 




ai 




ai 










= TX 




_ (J^Xt 1 ^Xi ) 


{kxT t ^Xt ) _ 




CXt 




CXt 



From the above, we find the top J eigenvectors which produce the highest J 
eigenvalues and approximate the dataset with a J-dimensional nonlinear mani- 
fold. The eigenfunctions v^{x) of the covariance matrix describe axes of variation 
on the manifold and are unit-norm functions approximated by: 

T 

v\x) OC E alk{x,xt). 

These are normalized such that = 1. The spectrum of eigenvalues de- 

scribes the overall shape of a Gaussian model of the data in Hilbert space while 
the eigenvectors of the covariance matrix capture the Gaussian’s orientation. 
The volume of the data can then be approximated by the determinant of the 
covariance matrix which equals the product of its eigenvalues A-' . 

Volume « I A| = \^ . 

3 

If we are dealing with a truly low-dimensional subspace, only a few eigenvalues 
(corresponding to eigenvectors spanning the manifold) will be large. The many 
remaining eigenvalues corresponding to noise off of the manifold will be small and 
the volume we ultimately estimate by multiplying all these eigenvalues will be 

^ While this Hilbert space could potentially be infinite dimensional and Gaussians 
and kernel PCA should be handled more formally (i.e. using Gaussian processes 
with white noise and appropriate operators) in this paper and for our purposes we 
will assume we are manipulating only finite-dimensional Hilbert spaces. Formalising 
the extensions to infinite Hilbert space is straightforward. 
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low^. Thus, a kernel PCA manifold that is low-dimensional should typically have 
low volume. It is well known that kernel PCA can also be (implicitly) centered 
by estimating and removing the mean of the data yet we will not elaborate this 
straightforward issue (refer instead to [12]). Before applying PCA, recall that we 
perform maximum likelihood estimation to obtain the mean /t and the covariance 
S. The volume of the dataset is related to its log-likelihood under the maximum 
likelihood estimate of a Gaussian model as shown in [4]: 

^) = X! Im. 

t 

'T'r) 'T' 1 

= ^log(27r) - -log|A| - -^(xt-/x)^A-^(a:t-/x). 

t 



Log-likelihood simplifies as follows when we use the maximum likelihood setting 
for the mean fi= )p 'Yht and covariance S = ^ ~ ■ 






TD T 

iog(2^)--iog|r| 



TD 

~Y' 



Therefore, we can see that a kernel PCA solution which has high log-likelihood 
according to the Gaussian mean and covariance will also have low volume low 
log I A I and produce a compact low-dimensional manifold requiring few principal 
axis to span the data. 



2.2 Permutations That Maximize Likelihood and Minimize Volume 

We saw that we are solving a maximum likelihood problem to perform kernel 
PCA and higher likelihoods indicate lower volume and a better subspace. How- 
ever, the above formulation assumes we have vectors or can readily compute 
kernels or inner products between kPCA’s T Hilbert-space vectors 
This is not trivial when each Xt is actually an unordered bag of tuples as we had 
when we were previously dealing with Xt- However, given an ordering of each 
via At matrices or pt permutations, we can consider computing a kernel on the 
sorted bags as follows: 



N N 

k{Pt®XuPt®Xt-) = 

i=l i=l 

assuming we have defined a base kernel «:(., .) between the actual objects Xt,n 
in our bags. Another potentially clearer view of the above is to instead assume 
we have bags of Hilbert-space vectors where our dataset T> has T of these sets 
or bags T> = Each of these bags is merely a collection of N unordered 

objects in Hilbert space (Pt = {^(7i,n)}^=i- Applying the ordering pt to this 

^ Here we are assuming that we do not obtain any zero-valued eigenvalues which 
produce a degenerate estimate of volume. We will regularize eigenvalues in the sub- 
sequent sections to avoid this problem. 
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unordered bag of Hilbert space vectors provides an ordered set as follows pt®^t = 
Inner products between two ordered bags are again given in 
terms of the base kernel k(., .) as follows: 

N N 

i=l i=l 

As in [4] we will find settings of At or pt that maximize likelihood under a 
Gaussian model to minimize volume. However, instead of directly minimizing the 
volume by assuming we always have updated the mean and covariance with their 
maximum likelihood setting, we will treat the problem as an iterative likelihood 
maximization scheme. We have the following log-likelihood problem which we 
argued measures the volume of the data at the maximum likelihood estimate of 
/i and S: 

l{pi,...,PT,p,S) = ^lt(pt,P,S) = ^logA/'(pt A). 

t t 

Further increasing likelihood by adjusting pi, . . . ,pT will also further decrease 
volume as we interleave updates of p and S. Thus, the above is an objective 
function on permutations and maximizing it should produce an ordering of our 
bags that keeps kernel PCA efficient. Here, we are assuming we have a Gaussian 
in Hilbert space yet it is not immediately clear how to maximize or evaluate 
the above objective function and obtain permutation configurations that give 
low-volume kernel PGA manifolds. We will next elaborate this and show that 
all computations are straightforward to perform in Hilbert space. 

We will maximize likelihood over pi, . . . ,px, p and S iteratively in an axis- 
parallel manner. This is done by locking all parameters of the log- likelihood and 
modifying a single one at a time. Note, first, that it is straightforward, given a 
current setting of (pi, . . . ,pt) to compute the maximum likelihood p and S as 
the mean and covariance in Hilbert space. Now, assume we have locked p and 
A at a current setting and we wish to only increase likelihood by adjusting the 
permutation pt of a single bag <Pf We investigate two separate cases. In the first 
case, we assume the covariance matrix A is locked at a scalar times identity and 
we find the optimal update for a given pt by solving a linear assignment problem. 
We will then consider the more general case where the current A covariance 
matrix in Hilbert space is an arbitrary positive semi-definite matrix and updating 
the current pt will involve solving a quadratic assignment problem. 



3 Kernelized Sorting Via LAP and Mean Alignment 

Given p, pi, . . . ,pj< and S = al we wish to find a setting of pt which maximizes 
the likelihood of an isotropic Gaussian. This clearly involves only maximizing 
the following contribution of bag t to the total log-likelihood: 



k{pt,p,S) = logA/'(pt G ^t|/i,cr/). 
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We can simplify the above as follows: 

= const - - 2(pt (g)<?t,Ai) + {p,p)) ■ 

Z(7 

Since {pt ®'l’t,Pt® ’^t) is constant despite our choice of pt, maximizing the above 
over Pt is equivalent to minimizing the following cost function: 

Pt = arg min (p* ®(Pt,p) ■ 

Pt 

Assume we have the current maximum likelihood mean which is computed from 
the locked permutation configurations from the previous iteration pi, . . . ,pj’. 
The above then simplifies into: 

/ ^ T \ NT 

Pt = arg inin lpt®<^t,7^ E Pt' = arg min EE'«(t' ,p^i (i)) ■ 

\ ^ t' = l / i=l t' = l 

The above problem is an instance of the linear assignment problem (LAP) and 
can directly be solved producing the optimal pt in O(N^) via the Kuhn-Munkres 
algorithm (or more efficient variants such as QuickMatch [10], auction algorithms 
or the cost scaling algorithm). Essentially, we find the permutation matrix At 
which is analogous to pt by solving the assignment problem on the N x N matrix 
Dt via a simple call to the (standard) function KuhnMunkres(— Ht) where Dt 
is an N X N matrix giving the value of kernel evaluations between items in the 
current bag and the mean bag. We define the Dt matrix element-wise as: 

T 

t’ = l 

Iterating the update of each pt in this way for t = 1 . . . T and updating 
the mean pt repeatedly by its maximum likelihood estimate will converge to a 
maximum of the log- likelihood. While a formal proof is deferred in this paper, 
this maximum may actually be global since the above problem is analogous to 
the generalized Procrustes problem [1]. In the general Procrustes setting, we can 
mimic the problem of aligning or permuting many bags towards a common mean 
by instead computing the alignments or permutations between all possible pairs 
of bags. For instance, it is possible to find permutations ptp or matrices At^t' 
that align each bag Xt to any other bag Xt’ via [£>(,*']*,*' = ,i’)- These 

then give a consistent set of permutations to align the data towards a com- 
mon mean prior to kernel PCA. This provides us with the ordering pi, . . . ,pT 
of the data which now becomes a dataset of ordered bags {pt®^t}J^i- Sub- 
sequently, we perform kernel PCA on the data in 0{T^) using singular value 
decomposition on the T xT centered Gram matrix. This gives the eigenvectors, 
eigenvalues and eigenfunctions that span the nonlinear manifold representation 
of the ordered data. This will have a higher likelihood and potentially use fewer 
principal components to achieve the same reconstruction accuracy than imme- 
diate application of kernel PCA on the dataset T>. Of course, this argument only 
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holds if the dataset itself truly has a natural permutation invariance or was a 
collection of sets or bags. 

We now turn to the more general case where the Gaussian covariance is 
arbitrary and is not artificially locked at a spherical configuration. However, in 
this setting, global convergence claims are even more elusive. 

4 Kernelized Sorting Via QAP and Covariance Alignment 

In the case where we consider anisotropic Gaussians, the covariance matrix is 
an arbitrary positive semi-definite matrix and we have a more involved proce- 
dure for updating a given p*. However, this is more closely matched to the full 
problem of minimizing the volume of the data and should produce more valu- 
able orderings that further reduce the number of kernel principal components we 
need to represent the ordered bags. Here, we are updating a single pt again yet 
the covariance matrix S is not a scaled identity. We therefore have the following 
contribution of bag t to the log-likelihood objective function: 

= logA/'(pt 

Due to the presence of the S, this will no longer reduce to a simple linear 
assignment problem that is directly solvable for At or pt using a polynomial 
time algorithm. In fact, this objective will produce an NP-Gomplete quadratic 
assignment problem [9] . Instead we will describe an iterative technique for maxi- 
mizing the likelihood over pt by using a variational upper bound on the objective 
function. 

Define the inverse matrix M = which we will assume has actually been 
regularized as follows M = where ei and C 2 are small scalars (the 

intuition for this regularization is given in [5]). Recall kernel PGA (with abuse 
of notation) gives the matrix E as follows E = X^v^{v^)^ . Meanwhile, the 

matrix M can also be expressed with abuse of notation in terms of its eigenvalues 
Afc and eigenfunctions from as follows M = -I- al. We can 

assume we pick a finite J that is sufficiently large to have a faithful approximation 
to M. Recall that, as in kernel PGA, the (unnormalized) eigenfunctions are 
given by the previous estimate of the inverse covariance at the previous (locked) 
estimates of the permutations pp 

T 

{p®$,pt 

t=l 

where the normalization such that , ) = 1 is absorbed into the A-' for brevity. 

We can now rewrite the (slightly regularized) log-likelihood more succinctly by 
noting that p and E are locked (thus some terms become mere constants): 

1 T 

lt{pt) = const - -{pt®‘l>t- p) M{pt p) 

= const — ^(pt (g) ‘Pt)^M{pt (g) ^t) -f {pt (g) $t)^Mp 
I ^ ^ . 

= const — -(pt (g) ‘PtY '^2, A'’w'’(w^)^(pt (g) $t) + (pt (g) <l>t2 Mp 
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where we have used the expanded definition of the M matrix yet its isotropic 
contribution a I as before has no effect on the quadratic term involving pt . How- 
ever, the anisotropic contribution remains and we have a QAP problem which 
we continue simplifying by writing the eigenvectors as linear combinations of 
Hilbert space vectors or kernel functions: 

1 ^ V 

lt{pt) = const - 2 I (j5t ® ® ^m) 1 

j — 1 \m— 1 / 

J T T 

+ ^ {Pt ® $t,Pm ® ^m) ^ ai {p,Pn ® ^n) + C {pt ■ 

j = l m=l n=l 

For notational convenience, exchange the pt notation and start using the per- 
mutation matrix notation At by noting the following relationship: 

N N 

{pt 0 ® <?i') = X! X! 

We can now rewrite the (negated) log-likelihood term as a cost function C{At) = 
—l{At) over the space of permutation matrices At- This cost function is as follows 
after we drop some trivial constant terms: 

J ~j / N N T N N 

C{At) «mK(7t,i>7m.p,„(i')) j “ [At] i.i' [ A] i,i' 

J — 1 \i— 1 i' = \ m — 1 / i—1 i' = l 

where we have defined the readily computable N x N matrix Dt element-wise 
as follows for brevity: 



J T / T 

j = l m—1 \n—l 

T 

t'=i 

This matrix degenerates to the previous isotropic case if all anisotropic Lagrange 
multipliers go to zero leaving only the al contribution. Note, we can fill in the 
terms in the parentheses as follows: 

^ T ^ T N 

(M)Pn®^n) ^ ^ jPt' ^ ^t'lPn ^ ^n) ^ A (7w,Pn(i) i 7i',p,/(t)) 

t' = l t' = l i=l 

which lets us numerically compute the Dt matrix’s N x N entries. 

Clearly the first term in C{At) is quadratic in the permutation matrix At 
while the second term in C{At) is linear in the permutation matrix. Therefore, 
the second LAP term could be optimized using a Kuhn-Munkres algorithm how- 
ever, the full cost function is a quadratic assignment problem. To address this 
issue, we will upper bound the first quadratic cost term with a linear term such 
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that we can minimize C{At) iteratively using repeated applications of Kuhn- 
Munkres. This approach to solving QAP iteratively via bounding and LAP is 
similar in spirit to the well-known Gilmore-Lawler bound method as well as other 
techniques in the literature [9]. 

First, we construct an upper bound on the cost by introducing two J x N 
matrices called Q and Q. The entries of both Q and Q are non-negative and 
have the property that summing across their columns gives unity as follows: 

= 1 and = 1. 

i i' 



We insert the ratio of a convex combination of these two matrices (weighted by 
a positive scalar 6^ G [0, 1]) into our cost such that C{At) = 



j 



E 



y 



N N 



EEi^^ 

i' = l 






+ (1 ~ ^^)[Q]j,i' 
^^[Q]j,i A (1 ~ ^^)[Q]j,i' 



T 

E 

m—l 






N N 
i—l i' — l 



Note that this in no way changes the cost function, we are merely multiplying 
each entry of the matrix At by unity. Next recall that the squaring function 
f{x) = is convex and we can therefore apply Jensen’s inequality to pull 

terms out of it. We first recognize that we have a convex combination within the 
squaring since: 

N N 

+ = 1 Vj. 

i' = l 



Therefore, we can proceed with Jensen to obtain the upper bound on cost as 
follows, C{At) < 



J 7 j N N 

E y E E[^‘]m' + (1 - snmy) 

j—1 i=l i' = l 



\ ^^[Q]j,i + ~ J 

N N 



The above bound is actually just a linear assignment problem (LAP) which we 
write succinctly as follows: 



N N 



c{At)<j2J2i^t 



i—l i' — l 



E 



(Ei 



.T 

\ 2^m—l ' 






'm,prn,{i') 



2 S^Q]j^i + {I — S^)[Q]j 



- [Dth 



The above upper bound can immediately be minimized over permutation matri- 
ces and gives At via a Kuhn-Munkres computation or some variant. However, we 
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would need to actually specify Q, Q and all the 6^ for this computation. In fact, 
the right hand side is a variational LAP bound over our original QAP with the 
(augmented parameters) over Q, Q, d = {S^ , , S"^) and At which can each be 
iteratively minimized. Thus, we anticipate repeatedly minimizing over At using 
Kuhn-Munkres operations followed by updates of the remaining bound param- 
eters given a current setting of At. Note, the left term in the square bracket is 
constant if all eigenvalues Xj are equal (in which case the log-likelihood term 
overall is merely an LAP). Thus, we can see that the variance in the eigen- 
values is likely to have some effect as we depart from a pure LAP setting to a 
more severe QAP setting. This variance in eigenvalue spectrum can give us some 
indication about the convergence of the iterative procedure. 

We next minimizing the bound on the right hand size over Q and Q which 
is written more succinctly as follows: 



mm mm 
Q Q 



N N ,J 

EEE 

2=1 i' — l j — 1 






T (1 ~ ^^)[Q]j,i 



where we have defined each matrix element-wise using the formula at the 
current setting of At 

[P^]i,i' = [At]i,i>X^ I 'y ^ Ctm^(Ti,i) Tm,33m(i')) 

\m— 1 

This is still not directly solvable as is. Therefore we consider another varia- 
tional bounding step (which leads to more iterations) by applying Jensen on the 
convex function f{x) = \jx (this is true only when x is non-negative which is 
the case here). This produces the following inequality: 




N N 



^ ^ [Qk^ + (1 - Si)[Q]jy ^ [Q\3,^ [Q]j,v 



Clearly, once we have invoked the second application of Jensen’s inequality on 
this function, we get an easy update rule for Q by taking derivatives and setting 
to zero. In addition, we introduce the Lagrangian constraint that enforces the 
summation to unity = 1- Ultimately, we obtain this update rule: 



,2 “ 

Similarly, Q is updated as follows: 



The remaining update rule for the 6^ values is then given as follows: 



[Q]j 



VWn^' 



mm 

6 



N N J 

EEE 

2 = 1 2 ^ = 1 j — 1 






\i,i' 







620 



T. Jebara 



The terms for each single 5^ are independent and yield the following: 



N N 
i—l i' — l 






One straightforward manner to minimize the above extremely simple cost over 
a scalar G [0, 1] is to use brute force techniques or bisection/Brent’s search. 

Thus, we can iterate updates of Q, Q, and the S with updates of At to iter- 
atively minimize the upper bound on C(At) and maximize likelihood. Updating 
At is straightforward via a Kuhn Munkres algorithm (or faster heuristic algo- 
rithms such as QuickMatch [10]) on the terms in the square bracket multiplying 
the entries of the At matrix (in other words, iterate a linear assignment problem, 
LAP) . Convergence of this iterative scheme is reasonable and improves the like- 
lihood as we update At. But, it may have local minima^. We are working on even 
tighter bounds that seem promising and should further improve convergence and 
alleviate the local minima problem. Once the iterative scheme converges for a 
given bag we obtain the At matrix which directly gives the permutation 
configuration pt ■ 

We continue updating the pt for each bag in our data set while also updating 
the mean and the covariance (or, equivalently, the eigenvalues, eigenvectors and 
eigenfunctions for kernel PC A). This iteratively maximizes the log- likelihood 
(and minimizes the volume of the data) until we reach a local maximum and 
converge to a final ordering of our dataset of bags {pt 0 



5 Implementation Details 



We now discuss some particular implementation details of applying the method 
in practice. First, we are not bound to assuming that there must be exactly N 
objects in each bag. Assume we are given t = I . . .T bags with a variable number 
Nt of objects in each bag. We first pick a constant N (typically N = max* Nt) 
and then randomly replicate (or sample without replacement for small N) the 
objects in each bag such that each bag has N objects. Another consideration is 
that we generally hold the permutation of one bag fixed since permutations are 
relative. Therefore, the permutation pi for bag is locked (i.e. for a permutation 
matrix we would set Ai = I) and only the remaining permutations need to be 
optimized. We then iterate through the data randomly updating each pt at a time 
from the permutations p2, ■ ■ ■ ,pt- We first start by using the mean estimator 
(LAP) and update its estimate for each pt until it longer reduces the volume (as 
measured by the regularized product of kPCA’s eigenvalues). We then iterate the 
update rule for the covariance QAP estimator until it no longer reduces volume. 
Finally, once converged, we perform kernel PCA on the sorted bags with the 
final setting of P2, ■■■ ,Pt- 

® This is not surprising since QAP is NP-Complete. 




Kernelizing Sorting, Permutation, and Alignment for Minimum Volume PCA 



621 



6 Experiments 

In a preliminary experiment, we obtained a dataset of T = 100 digits of 9’s and 
3’s as shown in Figure 2(a). Each digit is actually a bag or a set of iV = 70 total 
(a;,y) coordinates which form our ^t,n G We computed the optimal per- 
mutations pt for each digit using the minimum volume criterion (i.e. maximum 
likelihood with the anisotropic Gaussian case). Figure 2(b) shows the eigenvalue 
spectrum for PCA before ordering (i.e. assuming the given pseudo-random order- 
ing in the raw input dataset) as well as the eigenvalue spectrum after optimizing 
the ordering. Note that lower eigenvalues indicate a smaller subspace and that 
there are few true dimensions of variability in the data once we sort the bags. 
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(a) Bags of points dataset (b) Eigenvalue spectrum 




Fig. 2. Ordering figits as bags of permutable point-clouds prior to PCA. In (a) we 
see a sample of the original training set of 100 digits while in (b) we see the original 
PCA eigenvalue spectrum (darker bars) with the initial pseudo-random ordering in the 
data. In (b) we see the eigenvalue spectrum (lighter bars) after optimizing the ordering 
to minimize the volume of the subspace (or maximize likelihood under an anisotropic 
Gaussian). In (c), note the increasing log-likelihood as we optimize each pt- 



To visualize the resulting orderings, we computed linear interpolations be- 
tween the sorted bags for different pairs of digits in the input dataset. Figure 3 
depicts the morphing as we mix the coordinates of each dot in each digit with an- 
other. Note in (a), these ’bags of coordinates’ are unordered. Therefore, blending 
their coordinates results in a meaningless cloud of points during the transition. 
However, in (b), we note that the points in each bag or cloud are corresponded 
and ordered so morphing or linearly interpolating their coordinates for two dif- 
ferent digits results in a meaningful smooth movement and bending of the digit. 
Note that in (b) morphs from 3 to another 3, 9 to another 9 or a 3 to a 9 main- 
tain meaningful structure at the half-way point as we blend between one digit 
and another. This indicates a more meaningful ordering has emerged unlike the 
initial random one which, when blending between two digit shapes, always gen- 
erates a random cloud of {x,y) coordinates (see Figure 3(a)). For this dataset, 
results were similar for the mean vs. covariance estimator as well as linear vs. 
quadratic choices for the base kernel «(.,.). 
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(a) Morphing unsorted digits 



(b) Morphing sorted digits 



(c) Flow 



Fig. 3. Linear interpolation from left to right (morphing) of the point-clouds with and 
without sorting. In (a) we see the linear morphing between unordered point clouds 
which results in poor intermediate morphs that are not meaningful. Meanwhile in (b) 
where we have recovered good orderings pt for each digit by minimizing the Gaussian’s 
volume, we note that the digits preserve the correspondence between different parts 
and induce a smooth and natural morph between the two initial digit configurations. 
In (c) we show the two digits with arrows indicating the flow or correspondence. 



7 Conclusions 

We have proposed an algorithm for finding orderings or sortings of multiple sets 
of objects. These sets or bags need not contain scalars or vectors but rather 
contain N arbitrary objects. Interacting with these objects is done solely via 
kernel functions on pairs of them leading to a general notion of sorting in Hilbert 
space. The ordering or sorting we propose is such that we form a low-dimensional 
kernel PCA approximation with as few eigenfunctions as possible to reconstruct 
the manifold on which these bags exist. This is done by finding the permutations 
of the bags such that we move them towards a common mean in Hilbert space 
or a low-volume Gaussian configuration in Hilbert space. In this article, this 
criterion suggested two maximum likelihood objective functions: one which is 
a linear assignment problem and the other a quadratic assignment problem. 
Both can be iteratively minimized by using a Kuhn Munkres algorithm along 
with variational bounding. This permits us to sort or order sets in a general 
way in Hilbert space using kernel methods and to ultimately obtain a compact 
representation of the data. We are currently investigating ambitious applications 
of the method with various kernels and additional results available at: 

http : //www . cs . Columbia . edu/~ j ebar a/bags/ 

In future work, we plan on investigating discriminative variations of the sort- 
ing/ordering problem to build classifiers based on support vector machines or 
kernelized Fisher discriminants that sort data prior to classification (see [4] which 
elaborates a quadratic cost function for the Fisher discriminant). 
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Abstract. We consider the problem of labeling a partially labeled 
graph. This setting may arise in a number of situations from survey sam- 
pling to information retrieval to pattern recognition in manifold settings. 
It is also of potential practical importance, when the data is abundant, 
but labeling is expensive or requires human assistance. 

Our approach develops a framework for regularization on such graphs. 
The algorithms are very simple and involve solving a single, usually 
sparse, system of linear equations. Using the notion of algorithmic sta- 
bility, we derive bounds on the generalization error and relate it to struc- 
tural invariants of the graph. Some experimental results testing the per- 
formance of the regularization algorithm and the usefulness of the gen- 
eralization bound are presented. 



1 Introduction 

In pattern recognition problems, there is a probability distribution P according 
to which labeled and possibly unlabeled examples are drawn and presented to 
a learner. This P is usually far from uniform and therefore might have some 
non-trivial geometric structure. We are interested in the design and analysis of 
learning algorithms that exploit this geometric structure. For example, P may 
have support on or close to a manifold. In a discrete setting, it may have support 
on a graph. In this paper we consider the problem of predicting the labels on 
vertices of a partially labeled graph. Our goal is to design algorithms that are 
adapted to the structure of the graph. Our analysis shows that the generalization 
ability of such algorithms is controlled by geometric invariants of the graph. 

Consider a weighted graph G = (U, E) where V = {xi, . . . , x„| is the vertex 
set and E is the edge set. Associated with each edge G if is a weight Wij. 
If there is no edge present between x^ and x^, Wij = 0. Imagine a situation 
where a subset of these vertices are labeled with values yi G K. We wish to 
predict the values of the rest of the vertices. In doing so, we would like to exploit 
the structure of the graph. In particular, in our approach we will assume that 
the weights are indications of the affinity of nodes with respect to each other 
and consequently are related to the potential similarity of the y values these 
nodes are likely to have. Ultimately we propose an algorithm for regularization 
on graphs. 
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This general problem arises in a number of different settings. For example, in 
survey sampling, one has a database of individuals along with their preference 
profiles that determines a graph structure based on similarity of preferences. 
One wishes to estimate a survey variable (e.g. hours of TV watched, amount of 
cheese consumed, etc.). Rather than survey the entire set of individuals every 
time, which might be impractical, one may sample a subset of the individuals 
and then attempt to infer the survey variable for the rest of the individuals. 
In Internet and information retrieval applications, one is often in possession of 
a database of objects that have a natural graph structure (or more generally 
affinity matrix). One may wish to categorize the objects into various classes but 
only a few (object, class) pairs may be obtained by access to a supervised or- 
acle. In the Finite Element Method for solving PDEs, one sometimes evaluates 
the solution at some of the points of the finite element mesh and one needs to 
estimate the value of the solution at all other points. A final example arises 
when data is obtained by sampling an underlying manifold embedded in a high 
dimensional space. In recent approaches to dimensionality reduction, clustering 
and classification in this setting, a graph approximation to the underlying man- 
ifold is computed. Semi-supervised learning in this manifold setting reduces to 
a partially labeled classification problem of the graph. This last example is an 
instantiation of transductive learning where other approaches include the Naive 
Bayes for text classification in [12], transductive SVM [15,9], the graph mincut 
approach in [2], and the random walk on the adjacency graph in [14]. We also 
note the closely related work [11], which uses kernels and in particular diffusion 
kernels on graphs for classification. 

In the manifold setting the graph is easily seen to be an empirical object. It 
is worthwhile to note that in all applications of interest, even those unrelated to 
the manifold setting, the graph reflects pairwise relationships on the data, and 
hence is an empirical object whenever the data consists of random samples. 

We consider this problem in some generality and introduce a framework for 
regularization on graphs. Two algorithms are derived within this framework. The 
resulting optima have simple analytical expressions. If the graph is sparse, the 
algorithms are fast and, in particular, do not require the computation of multiple 
eigenvectors as is common in many spectral methods (including our previous 
approach [1]). Another advantage of the current framework is that it is possible 
to provide theoretical guarantees for generalization error. Using techniques from 
algorithmic stability we show that generalization error is bounded in terms of 
the smallest nontrivial eigenvalue (Fiedler number) of the graph. Interestingly, 
it suggests that generalization performance depends on the geometry of the 
graph rather than on its size. Finally some experimental evaluation is conducted 
suggesting that this approach to partially labeled classification is competitive. 

Several groups of researchers have been investigating related ideas. In partic- 
ular, [13] also proposed algorithms for graph regularization. In [17] the authors 
propose the Label Propagation algorithm for semi-supervised learning, which 
is similar to our Interpolated Regularization when S = L. In [16] a somewhat 
different regularizer together with the normalized Laplacian is used for semi- 
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supervised learning. The ideas of spectral clustering motivated the authors of [4] 
to introduce Cluster Kernels for semi-supervised learning. The authors suggest 
explicitly manipulating eigenvalues of the kernel matrix. We also note closely 
related work on metric labeling [10]. 

2 Regression on Graphs 

2.1 Regularization and Regression on Graphs 

To approximate a function on a graph G, with the weight matrix Wij we need 
a notion of a “good” function. One way to think about such a function is that 
is that it does not make too many “jumps”. We formalize that notion (see also 
our earlier paper [1]), by the smoothness functional 



where the sum is taken over the adjacent vertices of G. For “good” functions / 
the functional S takes small values. 

It is important to observe that 



where L is the Laplacian L = D — W, D = diag(^j Wu , . . . , Wni)- This is 
a basic identity in the spectral graph theory and provides some intuition for the 
remarkable properties of the graph Laplacian L. 

Other smoothness matrices, such as p G N, exp(— tL), t G M are also 
possible. In particular, LF' often seems to work well in practice. 



2.2 Algorithms for Regression on Graphs 

Let G = {V, E) be a graph with n vertices and the weight matrix Wij. For the 
purposes of this paper we will assume that G is connected and that the vertices 
of the graph are numbered. We would like to regress a function / : K — >■ R. / 
is defined on vertices of G, however we have only partial information, say for 
the first k vertices. That is /(x^) = Ui, 1 < i < k. The labels can potentially 
be noisy. We also allow data points to have multiplicities, i.e. each vertex of the 
graph may appear more than once with same or different y value. 

We precondition the data by mean subtracting first. That is we take 



y = {yi~y,--- ,yk-y) 



where y = \ 'Yhyi- This is needed for stability of the algorithms as will be seen 
in the theoretical discussion. 
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Algorithm 1: Tikhonov regularization (parameter 7 G K). The objec- 
tive is to minimize the square loss function plus the smoothness penalty. 

f= argmin ^ V(/i - -h 7f‘5f* 

f=(/l /n) k 

E/i=0 

S here is a smoothness matrix, e.g. S = L or S = L^, p G N. The condition 
X) /i = 0 i® needed to make the algorithm stable. It can be seen by following the 
proof of Theorem 1 that necessary stability and the corresponding generalization 
bound cannot be obtained unless the regularization problem is constrained to 
functions with mean 0. 

Without the loss of generality we can assume that the first I points on the 
graph are labeled. I might be different from the number of sample points k, since 
we allow vertices to have different labels (or the same label several times). 

The solution to the quadratic problem above is not hard to obtain by stan- 
dard linear algebra considerations. If we denote by 1 = (1,1,... ,1) the vector 
of all ones, the solution can be given in the form 

i = {kjS + h)-\y + fil) 

Here y is the n- vector y = yu, 2/™: ■ > 0)> where we sum 

the labels corresponding to the same vertex on the graph. 

Ik is a diagonal matrix of multiplicities 

Ik = diap(ni,n2 ,... ,n/, 0 ,... ,0) 

where Ui is the number of occurrences of vertex i among the labeled point in the 
sample, /i is chosen so that the resulting vector f is orthogonal to 1. Denote by 
s(f) the functional 

s-f 

i 

Since s is linear, we obtain 0 = s(f) = s {{kyS + Ik)~^y) + s {{kyS + Ik)~^l). 
Therefore we can write 

_ s{{kjS + Ik)~^y) 
s{{kyS + IkYl) 

Note that dropping the condition f _L 1 is equivalent to putting p = 0. 

Algorithm 2: Interpolated Regularization (no parameters). 

Here we assume that the values yi, ■ ■ ■ ,yk have no noise. Thus the optimiza- 
tion problem is to find a function of maximum smoothness satisfying /(x^) = 

1 < z < fc: 

f = argmin 

f=(yi,... + ,/n) 

T.fi=o 
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As before S' is a smoothness matrix, e.g. L or However, here we are not 
allowing multiple vertices in the sample. We partition S as 



S = 



(Si S2 



where Si is a fc x fc matrix, S 2 is k x n — k and S3 is {n — k) x {n — k). Let / be 
the values of /, where the function is unknown, / = (/fc+i, ■ • ■ , fn)- 
Straightforward linear algebra yields the solution: 



/ = S3 ,y'k)^ + yl) 



s(S3~^Sjy) 

s(S3-^Sjl) 

The regression formula is very simple and has no free parameters. However, 
the quality of the results depends on whether S3 is well conditioned. 

It can be shown that Interpolated Regularization is the limit case of Tikhonov 
regularization when 7 tends to 0. That is, given a function /, and denoting 
by TZeg-y and TZegint, Tikhonov regularization and Interpolated regularization, 
respectively, we have 

limTZegjif) = 7^eg„t(/) 

7—^0 

That correspondence suggests using the condition / T 1 for interpolated regu- 
larization as well, even though no stability-based bounds are available in that 
case. 

It is interesting to note that this condition, imposed for purely theoretical 
reasons, seems similar to class mass normalization step in [17]. 



3 Theoretical Analysis 

In this section we investigate some theoretical guarantees for the generalization 
error of regularization on graphs. We use the notion of algorithmic stability, 
first introduced by Devroye and Wagner in [6] and later used by Bousquet and 
Elisseeff in [3] to prove generalization bounds for regularization networks. 

The goal of a learning algorithm is to learn a function on some space V from 
examples. Given a set of examples T the learning algorithm produces a function 
/r : y — >■ K. Therefore a learning rule is a map from data sets into functions on 
V. We will be interested in the case where R is a graph. 

The empirical risk Rk{f) (with the square loss function) is a measure of how 
well we do on the training set: 
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The generalization error i?(/) is the expectation of how well we do on all 
points, labeled or unlabeled. 

i?(/) = (/(x) - y(x))2 

where the expectation is taken over an underlying distribution /r on R x K 
according to which the labeled examples are drawn. 

As before denote the smallest nontrivial eigenvalue of the smoothness matrix 
S' by Ai. If S' is the Laplacian of the graph, this value, first introduced by Fiedler 
in [7] as algebraic connectivity and is sometimes known as the Fiedler constant, 
plays a key role in spectral graph theory. One interpretation of Ai is that it gives 
an estimate of how well V can be partitioned. We expect Ai to be relatively 
large, say Ai>O(^),0<r<Cl. For example for an n-dimensional hypercube 
Ai = 2. If Ai is very small, a sensible possibility would be to cut the graph in 
two, using the eigenvector corresponding to Ai and proceed with regularization 
separately for the two parts. 

The theorem below states that as long as fc is large and the values of the 
solution to the regularization problem are bounded, we get good generalization 
results. We note that the constant K can be bounded using the properties of 
the graph. See the propositions below for the details. We did not make these 
estimates a part of the Theorem 1 as it would make the formulas even more 
cumbersome. 

Theorem 1 (Generalization Performance of Graph Regularization). 

Let 7 he the regularization parameter, T he a set of k > 4 vertices xi, . . . ,Xfc, 
where each vertex occurs no more than t times, together with values yi,. ■ . ,yk, 
\yi\ < M. Let /t he the regularization solution using the smoothness functional 
S with the second smallest eigenvalue Ai. Assuming that Vx|/t’(x)| < K we have 
with prohahility 1 — i5 (conditional on the multiplicity being no greater than t): 

\Rkih) - R{h)\ < /3 + y (fc/9 +{K + Mf) 

where 

SMVik 4M 

(fcyAi — kjXi — t 

Proof. The theorem is obtained by rewriting the formula in the Theorem 4 in 
terms of k and then applying the Theorem 5. 

We see that as usual in the estimates of the generalization error it decreases 
at a rate ^ . It is important to note that the estimate is nearly independent of 
the total number of vertices n in the graph. We say “nearly” since the probability 
of having multiple points increases as k becomes close to n and since the value 
of Ai may (or may not) implicitly depend on the number of vertices. 

The only thing that is missing is an estimate for K. Below we give two such 
estimates, one for the case of general S and the other, possibly sharper, when 
the smoothness matrix is the Laplacian S = L. 
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Proposition 1. With Ai, M and 7 as above we have the following inequality: 



ll/l|oo< 



M 



Proof. Let’s first denote the quantity we are trying to minimize by -P(f): 



i 



The first observation we make is that when f = 0, P(f) = ^ yf < M'^. Thus, 
if f minimizes P(f), we have 0 < yPLf < Recall that f € H, where H is 
the linear space of vectors with mean 0 and that the smallest eigenvalue of S 
restricted to H is Ai. Therefore, recalling that II/H 2 > ||/||oo) we obtain 

f‘if>Ai||/f >Ai||/|1L 



Thus 



ll/lloo < 




M 

y/Xvy 



A different inequality can be obtained when S = L. Note the the diameter 
of the graph is typically far smaller than the number of vertices. For example, 
when G is a n-cube, the number of vertices is 2", while the diameter is n. 



Proposition 2. Let W = min^^^j Wij be the smallest nonzero weight of the graph 
G. Assume G is connected. Let D be the unweighted diameter of the graph, i.e. 
the maximum length of the shortest path between two points on the graph. Then 
the maximum entry K of the solution to the ^-regularization problem with y ’s 
bounded by M satisfies the following inequality: 



K <M 




A useful special case is 

Corollary 2. Lf all weights of G are either 0 or 1, then 



K <M 

Proof. Using the same notation as above, we see by substituting the 0 vector 
that if f minimizes P(f), then Pf < 

Let K be the biggest entry of f with the corresponding vertex vi. Take any 
vertex V 2 for which there is a y < 0. Such vertex exists, since the data has mean 
0. Now let Cl, 62 , ... , 6 m be a sequence of edges on the graph connecting the 
vertices vi and V 2 . We put wi, . . . , Wm to be the corresponding weights and let 
go,gi, . . . , (/m be the values of f corresponding to the consecutive vertices of that 
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sequence. Now let hi = gi — gi-i be the differences of values of f along that path. 
We have hi = gm~ go> K. 

Consider the minimum value Z of '^iWih'f, given that '^ihi > K. Using 
Lagrangian multipliers, we see that the solution is given hy hi = We find a 
using the condition Xi hi = a Xi ^ Therefore 









y. d- 



Recall that „ i is the harmonic mean of numbers Wi and is therefore greater 
than min(wi, . . . , Wm)- Thus we obtain 

Wih‘f > min('u;i , . . . , Wm) 

I 

On the other hand, we see that 

f*Lf* = - fjf > E 

i<3, i~3 i 



since the right-hand sight of the inequality is a partial sum of the terms of the 
left-hand side. 

Hence 

P(f) > min(wi, . . . , Wm) 

m 

Recalling that P(f) < M^, we finally obtain: 



K < 



Myin 

A/7min(wi, . . . ,Wm) 



Since the path between those points can be chosen arbitrarily, we can chose it 
so that the length of the path m does not exceed the unweighted diameter D of 
the graph, which proves the theorem. 

In particular, if all weights of G are either zero or one, we have: 



K < 



mVd 



assuming, of course, that G is connected. 



To prove the main theorem we will use a result of Bousquet and Elisseeff 
([3]). First we need the following 

Definition 3. A learning algorithm is said to he uniformly (or algorithmically) 
(3-stahle, if for any two training sets T\, T 2 different at no more than one point, 



Vx 



I/ti(x) - /t2(x)| < (3 
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The stability condition can be thought of as the Lipschitz property for maps from 
the set of training samples endowed with the Hamming distance into L°°{V). 

Theorem 4 (Bousquet, Elisseeff). For a (3-stable algorithm T ^ fx we have: 

Ve > 0 Prob(|fl»(/rf - R(Ml >, + [i)< 2«p 

The above theorem^ together with the appropriate stability of graph regular- 
ization algorithm yields Theorem 1 . We now proceed to show that regularization 
on graphs using the smoothness functional S is /3-stable, with /3 as in Theorem 

1 . 

Theorem 5 (Stability of Regularization on Graphs). For data samples 
of size fc > 4 with multiplicity of at most t, ^-regularization using the smooth- 
ness functional S is a ^ algorithm, assuming that the 
denominator k^\\ — t is positive. 

Proof. Let Ff be the hyperplane orthogonal to the vector 1 = (!,... ,1). We will 
denote by Ph the operator corresponding to the orthogonal projection on FI . 
Recall that the solution to the regularization problem is given by 

(k'^S -\- Ifc)f = y-\-fj,l 

where fj, is chosen so that f belongs to H. We order the graph so that the labeled 
points come first Then the diagonal matrix I},, can be written as 

Ik =diag(ni,... ,npO, ... ,0) 

where I is the number of distinct labeled vertices of the graph and Ui < t is the 
multiplicity of the ith data point. The spectral radius of Ik is max(ni, . . . ,ni) 
and is therefore no greater than t. Note that I < k. 

On the other hand, the smallest eigenvalue of S restricted to is Ai. Noticing 
that H is invariant under S and that for any vector v, ||Rtr(v)|| < ||v||, since 
Ph is an orthogonal projection operator, and using the triangle inequality, we 
immediately obtain that for any { G H 

WPnik^S + Ik)i\\ > \\PHkjSi\\-\\PHlkI\\ > (Ai7A:-t)||f|| 

It follows that the spectral radius of the inverse operator {PnikyS -\- Ik))~^ 
does not exceed when restricted to H (of course, the inverse is not even 

defined outside of H). 

To demonstrate stability we need to show that the output of the algorithm 
does not change much when we change the input at exactly one data point. 
Suppose that y, y' are the data vectors different in at most one entry. We can 

^ Which is, actually, a special case of the original theorem, when the cost function is 
quadratic. 
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assume that y' contains a new point. The other case, when only the multiplicities 
differ, follows easily from the same considerations. Thus we write: 

y = ,'^yti,o,--- ,o) 

i i i 

y' = C^yn^^ya, ■ ■ ■ y»L2/z+i,o... ,o) 

i i i 

The sums are taken over all values of y corresponding to a node on a graph. The 
last sum contains one fewer term than the corresponding sum for y. 

Put y, y' to be the averages for y,y' respectively. We note that \y — y'\ < ^ 
and that the entries of y, y' differ by no more than that except for the last two 
entries, which differ by at most 2M + ^ . Of course, the last n — I — 1 entries 
of both vectors are equal to zero. Therefore 



l|y-y'll< 




(^M + 



2M 

IT 



2 

+ k 




< AM 



assuming that k > A. 

The solutions to the regularization problem f, f' are given by the equations 
i={PHhkS + Ik)r^y 
i' = {PnilkS + I'j,))-^y' 

where Ik and I'f. are nx n diagonal matrices, Ik = diag(ni, ri 2 , ... , n;, 0, . . . , 0), 
I'f. = diag(ni, ri 2 , ... , n; — 1, 1, 0, . . . ,0) and the operators are restricted to the 
hyperplane H . 

In order to ascertain stability, we need to estimate the maximum difference 
between the entries of f and f', ||f — f'||oo. We will use the fact that || ||oo < || ||- 
Put A = PniykS + Ik), B = PniykS + I'f.) restricted to the hyperplane H . 
We have 



f - f' = A-iy - B-iy' = A-i(y - y') + A'^y' - B~^y' 
Therefore 



||f - f'lloo < I|f - f'll < p-i(y - r)|| + p-'y' - B-^y'W 



AM, 



Since the spectral radius of zl ^ and B ^ is at most k-yx^-t l|y ~ y1l ^ 



p-'(y-y')ll< 



AM 

fcyAi — t 



On the other hand, it can be checked that ||y'|| < 2x/tkM. Indeed, it can 
be easily seen that the length is maximized, when the multiplicity of each point 
is exactly t. Noticing that the spectral radius of Pnilk — I'k) cannot exceed 
x/2 < 1 .5, we obtain: 



||A-iy' - B-iy'll = \\B-\B - A)A-iy'|| = \\B~^PH{h ~ I'k)A-^f)\\ < 
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iMy/tk 
~ {kjXi - 

Putting it all together 



l|f 



, ^ iMy/tk 

- (fc7Ai - ty 



AM 

kjXi — t 



Of course, we would typically expect 

However one issue still remains unresolved. Just how likely are we to have 
multiple points in a sample. Having high multiplicities is quite unlikely as long 
as fc <C n and the distribution is reasonably close to the uniform. 

We make a step in the direction with the following simple combinatorial 
estimate to show that for the uniform distribution on the graph, data samples, 
where point occur with high multiplicities (and, in fact, with any multiplicity 
greater than 1) are unlikely as long as k is relatively small compared to n. 

It would be easy to give a similar estimate for a more general distribution, 
where probability of each point is bounded from below by, say, 7, 0 < a < 1. 



Proposition 3. Assuming the uniform distribution on the graph, the probability 
P of a sample that contains some data point with multiplicity more than t can 
be estimated as follows: 



P< 



2n 

JtTiy. 




t+i 



Proof. Let us first estimate the probability Pi that the /th point will occur more 
than t times, when choosing k points at random from a dataset of n points with 
replacement. 



Pi 





Writing out the binomial coefficients and using an estimate via the sum of a 
geometric progression yields: 



i=t+l ^ ^ ^ ^ ^ ^ 

Assuming that A: < ^, we finally obtain 



1 1 

(t+ 1)! lynj 1 - L 



Pi < 



(t+ 1)! \nj 






Applying the union bound, we see that the probability P of some point being 
chosen more than t times is bounded as follows: 



p<Y.p^< 

2=1 



2n 

(tTl)! 




i+1 
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By rewriting k in terms of the probability, we immediately obtain the follow- 
ing 

Corollary 6. With probability at least 1 — e the multiplicity of the sample does 
not exceed t, given that k < n*“*+r. In particular, the multiplicity 

of the sample is exactly 1 with probability at least 1 — e, as long as k < 

4 Experiments and Discussion 

An interesting aspect of the generalization bound derived in the previous section 
is that it depends on certain geometric aspects of the graph. The size of the 
graph seems relatively unimportant. For example consider the edge graph of 
a d-dimensional hypercube. Such a graph has n = 2'^ vertices. However, the 
spectral gap is always Ai = 2. Thus the generalization bound on such graphs 
is independent of the size n. For other kinds of graphs, it may be the case that 
Ai depends weakly on n. For such graphs, we may hope for good generalization 
from a small number of labeled examples relative to the size of the graph. 

To evaluate the performance of our regularization algorithms and the insights 
from our theoretical analysis, we conducted a number of experiments. For ex- 
ample, our experimental results indicate that both Tikhonov and interpolated 
regularization schemes are generally competitive and often better than other 
semi-supervised algorithms. However, in this paper we do not discuss these per- 
formance comparisons. Instead, we focus on the performance of our algorithm 
and the usefulness of our bounds. 

We present results on two data sets of different sizes. 

4.1 Ionosphere Data Set 

The Ionosphere data set has 351 examples of two classes in a 34 dimensional 
space. A graph is made by connecting nearby ( 6 ) points to each other. This 
graph therefore has 351 vertices. We computed the value of the spectral gap of 
this graph and the corresponding bound using different values of 7 for different 
numbers of labeled points (see table 4). We also computed the training error (see 
table 2), the test error (see table 1), and the generalization gap (see table 3), to 
compare it with the value of the bound. 

For 7 > 1, the value of the bound is reasonable and the difference between 
the training and the test error is small, as can be seen in the last columns of 
these tables. However, both the training and the test error for 7=1 were high. 
In regimes where training and test errors were smaller, we find that our bound 
becomes vacuous. 

4.2 Mnist Data Set 

We also tested the performance of the regularization algorithm on the MNIST 
data set. We used a training set with 11,800 examples corresponding to a two 
class problem with digits 8 and 9. 
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Table 1. Ionosphere data set. Classifica- 
tion error rates on the test set. is the 
number of labeled examples. 



#L 


7 = 0.001 


0 

0 

II 


7 = 0.1 


7=1 


10 


0.36 


0.40 


0.38 


0.36 


20 


0.29 


0.35 


0.38 


0.36 


40 


0.22 


0.36 


0.37 


0.36 


60 


0.20 


0.36 


0.36 


0.36 


80 


0.17 


0.35 


0.39 


0.36 


100 


0.18 


0.30 


0.36 


0.36 


200 


0.20 


0.36 


0.35 


0.36 


300 


0.13 


0.40 


0.36 


0.34 



Table 3. Ionosphere data set. Difference 
between error rates on the test set and on 
the training set. 



Table 2. Ionosphere data set. Classifica- 
tion error rates on the training set. #L is 
the number of labeled examples. 



#L 


7 = 0.001 


7 = 0.01 


0 

II 


7=1 


10 


0.00 


0.09 


0.26 


0.30 


20 


0.01 


0.22 


0.29 


0.33 


40 


0.01 


0.25 


0.31 


0.35 


60 


0.08 


0.28 


0.36 


0.34 


80 


0.09 


0.30 


0.35 


0.36 


100 


0.10 


0.31 


0.36 


0.37 


200 


0.14 


0.35 


0.36 


0.36 


300 


0.15 


0.35 


0.36 


0.36 



Table 4. Ionosphere data set, Ai = 
34 . 9907 . Generalization bound for confi- 
dence (1 — S ), S = 0 . 1 . 



#L 


7 = 0.001 


0 

d 

II 


7 = 0.1 


7=1 


10 


0.36 


0.31 


0.12 


0.06 


20 


0.28 


0.13 


0.09 


0.03 


40 


0.21 


0.11 


0.06 


0.01 


60 


0.12 


0.08 


0.00 


0.02 


80 


0.08 


0.05 


0.04 


0.00 


100 


0.08 


0.01 


0.00 


0.01 


200 


0.06 


0.01 


0.01 


0.00 


300 


0.02 


0.05 


0.00 


0.02 



#L 


t-H 

0 

0 

d 

II 

d 


7 = 0.01 


d 

II 

d 


7=1 


10 


173.59 


32.87 


2.92 


1.16 


20 


1641.55 


16.38 


2.02 


0.82 


40 


2138.57 


9.73 


1.40 


0.58 


60 


469.07 


7.44 


1.14 


0.47 


80 


251.67 


6.22 


0.98 


0.41 


100 


173.02 


5.43 


0.87 


0.36 


200 


72.72 


3.64 


0.61 


0.26 


300 


48.97 


2.90 


0.50 


0.21 



We computed the training and the test error as well as the bound for this 
two-class problem. We report the results for the digits 8 and 9, averaged over 10 
random splits. Table 5 and table 6 show the error on the test and on the training 
set, respectively. The regularization algorithm achieves a very low error rate on 
this data set even with a small number of labeled points. The difference between 
the training and the test error is shown in table 7 and can be compared to the 
value of the bound in table 8 . 

Here again, we observe that the value of the bound is reasonable for 7 = 0.1 
and 7=1 but the test and training errors for these values of 7 are rather high. 
Note, however, that with 2000 labeled points, the error rate for 7 = 0.1 is very 
similar to the error rates achieved with smaller values of 7 . 

Interestingly, the regularization algorithm has very similar gaps between the 
training and the test error for these two data sets although the number of points 
in their graphs is very different (351 for the Ionosphere and 11, 800 for the MNIST 
two-class problem). The value of the smallest non-zero eigenvalue for these two 
graphs is, however, similar. Therefore the similarity in the generalization gaps is 
consistent with our analysis. 
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Table 5. Mnist data set, two-class classi- 
fication problem for digits 8 and 9. Clas- 
sification error rates on the test set. 



#L 


O 

o 

o 

II 

c- 


t-H 

O 

o 

II 

c- 


7=0.1 


7=1 


20 


0.04 


0.03 


0.45 


0.50 


40 


0.02 


0.03 


0.42 


0.40 


100 


0.02 


0.03 


0.37 


0.40 


200 


0.02 


0.02 


0.28 


0.41 


400 


0.02 


0.02 


0.09 


0.46 


800 


0.02 


0.02 


0.11 


0.44 


2000 


0.02 


0.02 


0.03 


0.41 



Table 7 . Mnist data set, two-class clas- 
sification problem for digits 8 and 9. Dif- 
ference between error rates on the test set 
and the on the training set. 



#L 


7=0.001 


o 

o 

II 

c- 


7=0.1 


7=1 


20 


0.04 


0.02 


0.12 


0.10 


40 


0.02 


0.02 


0.06 


0.04 


100 


0.01 


0.01 


0.05 


0.02 


200 


0.00 


0.00 


0.04 


0.02 


400 


0.00 


0.00 


0.00 


0.01 


800 


0.00 


0.00 


0.01 


0.02 


2000 


0.00 


0.00 


0.00 


0.01 



Table 6. Mnist data set, two-class classi- 
fication problem for digits 8 and 9. Clas- 
sification error rates on the training set. 



#L 


O 

o 

d 

II 

d 


O 

d 

II 

c- 


7=0.1 


7=1 


20 


0.00 


0.01 


0.33 


0.40 


40 


0.00 


0.01 


0.36 


0.36 


100 


0.01 


0.02 


0.32 


0.38 


200 


0.02 


0.02 


0.24 


0.39 


400 


0.02 


0.02 


0.09 


0.45 


800 


0.02 


0.02 


0.10 


0.42 


2000 


0.02 


0.02 


0.03 


0.40 



Table 8. Mnist data set, two-class clas- 
sification problem for digits 8 and 9, 
Ai =35.5460. Generalization bound for 
confidence (1-5), 5=0.1. 



#L 


7=0.001 


7=0.01 


d 

II 


7=1 


20 


1774.43 


16.04 


2.00 


0.81 


40 


1928.94 


9.55 


1.39 


0.57 


100 


166.74 


5.34 


0.87 


0.36 


200 


70.69 


3.58 


0.61 


0.26 


400 


37.13 


2.44 


0.43 


0.18 


800 


21.60 


1.69 


0.30 


0.13 


2000 


11.50 


1.04 


0.19 


0.08 



5 Conclusions 

In a number of different settings, the need arises to fill in the labels (values) of a 
partially labeled graph. We have provided a principled framework within which 
one can meaningfully formulate regularization for regression and classification on 
such graphs. Two different algorithms were then derived within this framework 
and have been shown to perform well on different data sets. 

The regularization framework offers several advantages. 

1. It eliminates the need for computing multiple eigenvectors or complicated 
graph invariants (min cut, max flow etc.). Unlike some previously proposed 
algorithms, we obtain a simple closed form solution for the optimal regressor. 
The problem is reduced to a single, usually sparse, linear system of equations 
whose solution can be computed efficiently. One of the algorithms proposed 
(interpolated regularization) is extremely simple with no free parameters. 

2. We are able to bound the generalization error and relate it to properties of 
the underlying graph using arguments from algorithmic stability. 

3. If the graph arises from the local connectivity of data obtained from sam- 
pling an underlying manifold, then the approach has natural connections to 
regularization on that manifold. 
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The experimental results presented here suggest that the approach has empir- 
ical promise. Our future plans include more extensive experimental comparisons 
and investigating potential applications to survey sampling and other areas. 
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Percept ron-L ike Performance for Intersections of 

Halfspaces 
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Given a set of examples on the unit ball in R" which are labelled by a halfs- 
pace h which has margin p (minimum Euclidean distance from any point to the 
separating hyperplane), the well known Perceptron algorithm finds a separating 
hyperplane. The Perceptron Convergence Theorem (see e.g. [2]) states that at 
most 4/p^ iterations of the Perceptron update rule are required, and thus the 
algorithm runs in time O(^). 

Our question is the following: is it possible to give an algorithm which has 
Perceptron-like performance, i.e. poly(n, runtime, for learning the intersection 
of two halfspaces with margin p? We say that a concept c has margin p with 
respect to a set of points X C R" if 

p = min{||z - y\\ : z & X,y & R”, c{z) ^ c{y)}/\\X\\. 

Here ||A|| denotes max^gx II^H. Note that for the case of a single halfspace 
where all examples lie on the unit ball, this definition of margin is simply the 
minimum Euclidean distance from any example to the separating hyperplane as 
stated above. 

The desired learning algorithm need not output an intersection of halfspaces 
as its hypothesis; any reasonable hypothesis class (which gives an online or PAG 
algorithm with the stated runtime) is fine. 

Motivation: This is a natural restricted version of the more general prob- 
lem of learning an intersection of two arbitrary halfspaces with no condition on 
the margin, which is a longstanding open question that seems quite hard (for 
this more general problem no learning algorithm is known which runs in time 
less than 2*^^"^). Given the ubiquity of margin-based approaches for learning a 
single halfspace, it is likely that a solution to the proposed problem would be of 
significant practical as well as theoretical interest. As described below it seems 
plausible that a solution may be within reach. 

Current status: The first work on this question is by Arriaga and Vem- 

r 1 ■ • • 

pala [1] who gave an algorithm that runs in time n ■ poly ( p ) + ( p j > i-®- 

polynomial in n but exponential in 1/p. Their algorithm randomly projects the 
examples to a low-dimensional space and uses brute-force search to find a con- 
sistent intersection of halfspaces. Recently we gave an algorithm [3] that runs 
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\ O(log(l/p)) ^ ^ ^ ^ . . 1 

in time n{-j , i.e. polynomial in n and quasipolynomial in Our al- 

gorithm also uses random projection as a first step, but then runs the kernel 
Perceptron algorithm with the polynomial kernel to find a consistent hypothe- 
sis as opposed to using brute-force search. We show that low degree polynomial 
threshold functions can correctly computing intersections of halfspaces with a 
margin (in a certain technical sense — see [3] for details); this implies that the 
degree of the polynomial kernel can be taken to be logarithmic in 1/p, which 
yields our quasipolynomial runtime dependence on p. Can this quasipolynomial 
dependence on the margin p be reduced to a polynomial? 
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The Optimal PAG Algorithm 
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Assume we are trying to learn a concept class C of VC dimension d with respect 
to an arbitrary distribution. There is PAC sample size bound that holds for 
any algorithm that always predicts with some consistent concept in the class C 
(BEHW89): 0(^(dlog y +log ^)), where e and S are the accuracy and confidence 
parameters. Thus after drawing this many examples (consistent with any concept 
in C), then with probability at least 1 — <5, the error of the produced concept 
is at most e. Here the examples are drawn with respect to an arbitrary but 
fixed distribution D, and the accuracy is measured with respect to the same 
distribution. There is also a lower bound that holds for any algorithm (EHKV89): 
l7(i((i + log j)). It means that at least this many examples are required for any 
algorithm to achieve error at most e with probability at least 1 — 5. The lower 
bound is realized by distributions on a fixed shattered set of size d. 

Conjecture: The one-inclusion graph algorithm of HLW94 always achieves 
the lower bound. That is after receiving 0(i(d-|-log i)) examples, its error is at 
most e with probability at least 1 — 5. 

The one-inclusion graph for a set of t+1 unlabeled examples uses the following 
subset of the {t + 1 (-dimensional hypercube as its vertex set: all bit patterns in 
{0, 1}*“*'^ produced by labeling the t-|-l examples with a concept in C. There is an 
edge between two patterns if they are adjacent in the hypercube (i.e. Hamming 
distance one). 




An orientation of a one-inclusion graph is an orientation of its edges so that 
the maximum out-degree of all the vertices is minimized. In HLW94 it is shown 
how to do this using a network flow argument. The minimum maximum out- 
degree can be shown to be at most at most d, the VC dimension of C. 
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The one-inclusion graph algorithm is formulated as a prediction algorithm: 
When given t examples labeled with a concept in C and one more unlabeled 
example, the algorithm produces a binary prediction on the unlabeled exampled 
How does this algorithm predict? It creates and orients the one-inclusion graph 
for alH -I- 1 examples. If there is a unique extension of the t labeled examples to 
a labeling of the last example, then the one-inclusion graph algorithm predicts 
with that labeling. However, if there are two labels possible for the unlabeled 
example (i.e. the unlabeled example corresponds to an edge), then the algorithm 
predicts with the label of the bit pattern at the head of the oriented edge. 

The expected error^ of the one-inclusion graph algorithm is at most 
(HLW94), and it has been shown that this bound is within a factor of 1 -I- o(l) 
of optimal (LLS02). On the other hand, predicting with an arbitrary consistent 
hypothesis, can lead to an expected error of log |)) (HLW94). So in this 

open problem we conjecture that the one-inclusion algorithm is also optimal in 
the PAG model. 

For special cases of intersection closed concept classes, the closure algorithm 
has been shown to have the optimum 0{\{d + log |)) bound (AO04). This al- 
gorithm is can be seen as an instantiation of the one-inclusion graph algorithm 
(the closure algorithm predicts with an orientation of the one-inclusion graph 
with maximum out-degree at most d) . There are cases that show that the upper 
bound of 0(i(dlog \ + log j)) that holds for any algorithm that predicts with 
a consistent hypothesis cannot be improved (e.g. AO04). However all such cases 
that we are aware of seem to predict with orientations of the one-inclusion graph 
that have unnecessarily high out-degree. 
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^ Prediction algorithms implicitly represent hypotheses. For any fixed set of t labeled 
examples, the predictions on the next unlabeled example define a hypothesis. How- 
ever, as for the algorithm discussed here, this hypothesis is typically not in C. 

^ This is the same as the probability of predicting wrong on the unlabeled example. 
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The following coins problem is a version of a multi-armed bandit problem where one 
has to select from among a set of objects, say classifiers, after an experimentation phase 
that is constrained by a time or cost budget. The question is how to spend the budget. 
The problem involves pure exploration only, differentiating it from typical multi-armed 
bandit problems involving an exploration/exploitation tradeoff [BF85]. It is an abstrac- 
tion of the following scenarios: choosing from among a set of alternative treatments after 
a fixed number of clinical trials, determining the best parameter settings for a program 
given a deadline that only allows a fixed number of runs; or choosing a life partner in the 
bachelor/bachelorette TV show where time is limited. We are interested in the computa- 
tional complexity of the coins problem and/or efficient algorithms with approximation 
guarantees. 

1 The Coins Problem 

We are given: 

- A collection of n independent coins, indexed by the setl, where each coin is specified 
by a probability density function (prior) over its head probability. The priors of the 
different coins are independent, and they can be different for different coins. 

- A budget b on the total number of coin flips. 

We assume the tail and the head outcomes correspond to receiving no reward and a 
fixed reward ( 1 unit) respectively. We are allowed a trial/learning period, constrained by 
the budget, for the sole purpose of experimenting with the coins, i.e., we do not collect 
rewards in this period. At the end of the period, we are allowed to pick only a single coin 
for all our future flips (reward collection). 

Let the actual head probability of coin i he 9i. We define the regret from picking 
coin i to be 9* — 6i, where 9* = maxygx • As we have the densities only, we basically 
seek to make coin flip decisions and a final choice that lead to minimizing our expected 
regret. It is easy to verify that when the budget is 0, the choice of coin that minimizes 
expected regret is one with maximum expected head probability over all the coins, i.e., 
maxi E{Oi), where 0i denotes the random variable corresponding to head probability 
of coin i, and the expectation E{0i) is taken over the density for coin i. 
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A strategy is a prescription of which coin to flip given all the coins’ flip outcomes so 
far. A strategy may be viewed as a finite directed rooted tree, where each node indicates 
a coin to flip, each edge indicates an outcome (heads or tails), and the leaves indicate the 
coin to choose [MLG04]. No path length from root to leaf exceeds the budget. Thus the 
set S of such strategies is finite. Associated with each leaf node j is the (expected) regret 
Tj , computed using the densities (one for each coin) at that node. Let pj be the probability 
of “reaching” leaf j : pj is the product of the probabilities of coin flip outcomes along the 
path from root to that leaf. We define the regret of a strategy to be the expected regret, 
where the expectation is taken over the coins’ densities and the possible flip outcomes: 
Regret(s) = X^jeTree Leafs of s PfG - optimal regret r* is then the minimum 
achievable (expected) regret and an optimal strategy s* is one achieving it' 

r* = min Regret(s) , s* = a,rg min Regret(s). (1) 

S s^S 

We assume the budget is no larger than a polynomial in n, and that we can represent 
the densities and update them (when the corresponding coin yields a heads or tails 
outcome), and compute their expectation efficiently (e.g., the family of beta densities). 
With these assumptions, the problem is in PSPACE [MLG04]. 

Open Problem 1. Is computing the first action of an optimal strategy NP-hard? 



2 Discussion and Related Work 

We explore budgeted learning in [MLG04,LMG03]. We show that the coins problem is 
NP-hard under non-identical coin flip costs and non-identical priors, by reduction from 
the Knapsack problem. We present some evidence that the problem remains difficult 
even under identical costs. We explore constant-ratio approximability for strategies and 
algorithms^: an algorithm is a constant ratio approximation algorithm if its regret does 
not go above a constant multiple of the minimum regret. We show that a number of 
algorithms such as round-robin and greedy cannot be approximation algorithms. In the 
special case of identical priors (and coin costs), we observe empirically that a simple 
algorithm we refer to as biased-robin beats the other algorithms tested, and furthermore, 
its regret is very close to the optimal regret on the limited range of problems for which 
we could compute the optimal. Biased-robin sets z = 1, and continues flipping coin i 
until the outcome is tails, at which time it sets i to (z mod zz) -f 1, and repeats until the 
budget is exhausted. Note that biased-robin doesn’t take the budget into account except 
for stopping! An interesting open problem is then: 

Open Problem 2. Is biased-robin a constant- ratio approximation algorithm, for iden- 
tical priors and budget ofb = 0{n) ? 



* No randomized strategy has regret lower than the optimal deterministic strategy [MLG04]. 

^ An algorithm defines a strategy (for each problem instance) implicitly, by indicating the next 
coin to flip [MLG04]. 
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